[yt-dev] Zombie jobs on eudora?

Matthew Turk matthewturk at gmail.com
Tue Jun 10 20:47:23 PDT 2014


Hi Nathan,

I believe there are two things at work here.

1) (I do not have high confidence of this one.)  YTArrays that are
referenced with a .d and turned into numpy arrays which no longer *own
the data* may be retaining a reference, but that reference doesn't get
freed later.  This happens often when we are doing things in the
hierarchy instantiation phase.  I haven't been able to figure out
which references get lost; for me, over 40 outputs, I lost 1560.  I
think it's 39 YTArrays per hierarchy.  This might also be related to
field detection.  I think this is not a substantial contributor.
2) For some reason, when the .grids attribute (object array) is
deleted on an index, the refcounts of those grids don't decrease.  I
am able to decrease their refcounts by manually setting
pf.index.grids[:] = None.  This eliminated all retained grid
references.

So, I think the root is that at some point, because of circular
references or whatever, the finalizer isn't being called on the
Gridndex (or on Index itself).  This results in the reference to the
grids array being kept, which then pumps up the lost object count.  I
don't know why it's not getting called (it's not guaranteed to be
called, in any event).

I have to take care of some other things (including Brendan's note
about the memory problems with particle datasets) but I am pretty sure
this is the root.

-Matt

On Tue, Jun 10, 2014 at 10:13 PM, Matthew Turk <matthewturk at gmail.com> wrote:
> Hi Nathan,
>
> All it requires is a call to .index; you don't need to do anything
> else to get it to lose references.
>
> I'm still looking into it.
>
> -Matt
>
> On Tue, Jun 10, 2014 at 9:26 PM, Nathan Goldbaum <nathan12343 at gmail.com> wrote:
>>
>>
>>
>> On Tue, Jun 10, 2014 at 10:59 AM, Matthew Turk <matthewturk at gmail.com>
>> wrote:
>>>
>>> Do you have a reproducible script?
>>
>>
>> This should do the trick: http://paste.yt-project.org/show/4767/
>>
>> (this is with an enzo dataset by the way)
>>
>> That script prints (on my machine):
>>
>> EnzoGrid    15065
>> YTArray     1520
>> list        704
>> dict        2
>> MaskedArray 1
>>
>> Which indicates that 15000 EnzoGrid objects and 1520 YTArray objects have
>> leaked.
>>
>> The list I'm printing out at the end of the script should be the objects
>> that leaked during the loop over the Enzo dataset.  The
>> objgraph.get_leaking_objects() function returns the list of all objects
>> being tracked by the garbage collector that have no references but still
>> have nonzero refcounts.
>>
>> This means the "original_leaks" list isn't necessarily a list of leaky
>> objects - most of the things in there are singletons that the interpreter
>> keeps around. To create a list of leaky objects produced by iterating over
>> the loop I take the set difference of the output of get_leaking_objects()
>> before and after iterating over the dataset.
>>
>>>
>>> If you make a bunch of symlinks to
>>> one flash file and load them all in sequence, does that replicate the
>>> behavior?
>>
>>
>> Yes, it seems to.  Compare the output of this script:
>> http://paste.yt-project.org/show/4768/
>>
>> Adjust the range of the for loop from 0 to 5 - creating the needed symlinks
>> to WindTunnel/windtunnel_4lev_hdf5_plt_cnt_0040 as needed.
>>
>>>
>>>
>>> On Tue, Jun 10, 2014 at 12:57 PM, Nathan Goldbaum <nathan12343 at gmail.com>
>>> wrote:
>>> >
>>> >
>>> >
>>> > On Tue, Jun 10, 2014 at 10:45 AM, Matthew Turk <matthewturk at gmail.com>
>>> > wrote:
>>> >>
>>> >> Hi Nathan,
>>> >>
>>> >> On Tue, Jun 10, 2014 at 12:43 PM, Nathan Goldbaum
>>> >> <nathan12343 at gmail.com>
>>> >> wrote:
>>> >> >
>>> >> >
>>> >> >
>>> >> > On Tue, Jun 10, 2014 at 6:09 AM, Matthew Turk <matthewturk at gmail.com>
>>> >> > wrote:
>>> >> >>
>>> >> >> Hi Nathan,
>>> >> >>
>>> >> >> On Mon, Jun 9, 2014 at 11:02 PM, Nathan Goldbaum
>>> >> >> <nathan12343 at gmail.com>
>>> >> >> wrote:
>>> >> >> > Hey all,
>>> >> >> >
>>> >> >> > I'm looking at a memory leak that Philip (cc'd) is seeing when
>>> >> >> > iterating
>>> >> >> > over a long list of FLASH datasets.  Just as an example of the
>>> >> >> > type
>>> >> >> > of
>>> >> >> > behavior he is seeing - today he left his script running and ended
>>> >> >> > up
>>> >> >> > consuming 300 GB of RAM on a viz node.
>>> >> >> >
>>> >> >> > FWIW, the dataset is not particularly large - ~300 outputs and
>>> >> >> > ~100
>>> >> >> > MB
>>> >> >> > per
>>> >> >> > output. These are also FLASH cylindrical coordinate simulations -
>>> >> >> > so
>>> >> >> > perhaps
>>> >> >> > this behavior will only occur in curvilinear geometries?
>>> >> >>
>>> >> >> Hm, I don't know about that.
>>> >> >>
>>> >> >> >
>>> >> >> > I've been playing with objgraph to try to understand what's
>>> >> >> > happening.
>>> >> >> > Here's the script I've been using:
>>> >> >> > http://paste.yt-project.org/show/4762/
>>> >> >> >
>>> >> >> > Here's the output after one iteration of the for loop:
>>> >> >> > http://paste.yt-project.org/show/4761/
>>> >> >> >
>>> >> >> > It seems that for some reason a lot of data is not being garbage
>>> >> >> > collected.
>>> >> >> >
>>> >> >> > Could there be a reference counting bug somewhere down in a cython
>>> >> >> > routine?
>>> >> >>
>>> >> >> Based on what you're running, the only Cython routines being called
>>> >> >> are likely in the selection system.
>>> >> >>
>>> >> >> > Objgraph is unable to find backreferences to root grid tiles in
>>> >> >> > the
>>> >> >> > flash
>>> >> >> > dataset, and all the other yt objects that I've looked at seem to
>>> >> >> > have
>>> >> >> > backreference graphs that terminate at a FLASHGrid object that
>>> >> >> > represents a
>>> >> >> > root grid tile in one of the datasets.  That's the best guess I
>>> >> >> > have
>>> >> >> > -
>>> >> >> > but
>>> >> >> > definitely nothing conclusive.  I'd appreciate any other ideas
>>> >> >> > anyone
>>> >> >> > else
>>> >> >> > has to help debug this.
>>> >> >>
>>> >> >> I'm not entirely sure how to parse the output you've pasted, but I
>>> >> >> do
>>> >> >> have a thought.  If you have a reproducible case, I can test it
>>> >> >> myself.  I am wondering if this could be related to the way that
>>> >> >> grid
>>> >> >> masks are cached.  You should be able to test this by adding this
>>> >> >> line
>>> >> >> to _get_selector_mask in grid_patch.py, just before "return mask"
>>> >> >>
>>> >> >> self._last_mask = self._last_selector_id = None
>>> >> >>
>>> >> >> Something like this patch:
>>> >> >>
>>> >> >> http://paste.yt-project.org/show/4316/
>>> >> >
>>> >> >
>>> >> > Thanks for the code!  I will look into this today.
>>> >> >
>>> >> > Sorry for not explaining the random terminal output I pasted from
>>> >> > objgraph
>>> >> > :/
>>> >> >
>>> >> > It's a list of objects created after yt operates on one dataset and
>>> >> > after
>>> >> > the garbage collector is explicitly called. Each iteration of the
>>> >> > loop
>>> >> > sees
>>> >> > the creation of objects representing the FLASH grids, hierarchy, and
>>> >> > associated metadata.  With enough iterations this overhead from
>>> >> > previous
>>> >> > loop iterations begins to dominate the total memory budget.
>>> >>
>>> >> The code snippet I sent might help reduce it, but I think it speaks to
>>> >> a deeper problem in that somehow the FLASH stuff isn't being GC'd
>>> >> anywhere.  It really ought to be.
>>> >>
>>> >> Can you try also doing:
>>> >>
>>> >> yt.frontends.flash.FLASHDataset._skip_cache = True
>>> >
>>> >
>>> > No effect, unfortunately.
>>> >
>>> >>
>>> >> and seeing if that helps?
>>> >>
>>> >> >
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> -Matt
>>> >> >>
>>> >> >> >
>>> >> >> > Thanks for your help in debugging this!
>>> >> >> >
>>> >> >> > -Nathan
>>> >> >> >
>>> >> >> _______________________________________________
>>> >> >> yt-dev mailing list
>>> >> >> yt-dev at lists.spacepope.org
>>> >> >> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>> >> >
>>> >> >
>>> >> >
>>> >> > _______________________________________________
>>> >> > yt-dev mailing list
>>> >> > yt-dev at lists.spacepope.org
>>> >> > http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>> >> >
>>> >> _______________________________________________
>>> >> yt-dev mailing list
>>> >> yt-dev at lists.spacepope.org
>>> >> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>> >
>>> >
>>> >
>>> > _______________________________________________
>>> > yt-dev mailing list
>>> > yt-dev at lists.spacepope.org
>>> > http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>> >
>>> _______________________________________________
>>> yt-dev mailing list
>>> yt-dev at lists.spacepope.org
>>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>
>>
>>
>> _______________________________________________
>> yt-dev mailing list
>> yt-dev at lists.spacepope.org
>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>



More information about the yt-dev mailing list