[yt-dev] Zombie jobs on eudora?

Matthew Turk matthewturk at gmail.com
Wed Jun 11 07:25:07 PDT 2014


I should also note, at some point in the future I want to get rid of
the object arrays for grids, but that timescale is longer.  Using
John's grid tree is a much better approach.

On Wed, Jun 11, 2014 at 9:21 AM, Matthew Turk <matthewturk at gmail.com> wrote:
> On Tue, Jun 10, 2014 at 10:53 PM, Matthew Turk <matthewturk at gmail.com> wrote:
>> On Tue, Jun 10, 2014 at 10:50 PM, Nathan Goldbaum <nathan12343 at gmail.com> wrote:
>>> For the leaking YTArrays, Kacper suggested the following patch on IRC:
>>>
>>> http://bpaste.net/show/361120/
>>>
>>> This works for FLASH but seems to break field detection for enzo.
>>
>> I don't think this will ever be a big memory hog, but it is worth fixing.
>>
>
> I've spent a small bit of time at this again this morning, and
> everything seems to come back down to the issue of having a numpy
> array of grid objects.  If I switch this to a list, the reference
> counting is correct again and things get deallocated properly.  I've
> tried a number of ways of changing how they're allocated, but none
> seem to work for getting the refcount correct.  Oddly enough, if I
> track both a list *and* an array (i.e., set self._grids =
> self.grids.tolist()) then the refcounting is correct.
>
> I'm sure there's an explanation for this, but I don't know it.  It
> looks to me like numpy thinks it owns the data and that it should
> decrement the object refcount.
>
> By adding this line:
>
> self._grids = self.grids.tolist()
>
> after the call to _populate_grid_objects() in grid_geometry_handler, I
> was able to get all references tracked and removed.
>
>>>
>>>
>>> On Tue, Jun 10, 2014 at 8:47 PM, Matthew Turk <matthewturk at gmail.com> wrote:
>>>>
>>>> Hi Nathan,
>>>>
>>>> I believe there are two things at work here.
>>>>
>>>> 1) (I do not have high confidence of this one.)  YTArrays that are
>>>> referenced with a .d and turned into numpy arrays which no longer *own
>>>> the data* may be retaining a reference, but that reference doesn't get
>>>> freed later.  This happens often when we are doing things in the
>>>> hierarchy instantiation phase.  I haven't been able to figure out
>>>> which references get lost; for me, over 40 outputs, I lost 1560.  I
>>>> think it's 39 YTArrays per hierarchy.  This might also be related to
>>>> field detection.  I think this is not a substantial contributor.
>>>> 2) For some reason, when the .grids attribute (object array) is
>>>> deleted on an index, the refcounts of those grids don't decrease.  I
>>>> am able to decrease their refcounts by manually setting
>>>> pf.index.grids[:] = None.  This eliminated all retained grid
>>>> references.
>>>>
>>>> So, I think the root is that at some point, because of circular
>>>> references or whatever, the finalizer isn't being called on the
>>>> Gridndex (or on Index itself).  This results in the reference to the
>>>> grids array being kept, which then pumps up the lost object count.  I
>>>> don't know why it's not getting called (it's not guaranteed to be
>>>> called, in any event).
>>>>
>>>> I have to take care of some other things (including Brendan's note
>>>> about the memory problems with particle datasets) but I am pretty sure
>>>> this is the root.
>>>>
>>>> -Matt
>>>>
>>>> On Tue, Jun 10, 2014 at 10:13 PM, Matthew Turk <matthewturk at gmail.com>
>>>> wrote:
>>>> > Hi Nathan,
>>>> >
>>>> > All it requires is a call to .index; you don't need to do anything
>>>> > else to get it to lose references.
>>>> >
>>>> > I'm still looking into it.
>>>> >
>>>> > -Matt
>>>> >
>>>> > On Tue, Jun 10, 2014 at 9:26 PM, Nathan Goldbaum <nathan12343 at gmail.com>
>>>> > wrote:
>>>> >>
>>>> >>
>>>> >>
>>>> >> On Tue, Jun 10, 2014 at 10:59 AM, Matthew Turk <matthewturk at gmail.com>
>>>> >> wrote:
>>>> >>>
>>>> >>> Do you have a reproducible script?
>>>> >>
>>>> >>
>>>> >> This should do the trick: http://paste.yt-project.org/show/4767/
>>>> >>
>>>> >> (this is with an enzo dataset by the way)
>>>> >>
>>>> >> That script prints (on my machine):
>>>> >>
>>>> >> EnzoGrid    15065
>>>> >> YTArray     1520
>>>> >> list        704
>>>> >> dict        2
>>>> >> MaskedArray 1
>>>> >>
>>>> >> Which indicates that 15000 EnzoGrid objects and 1520 YTArray objects
>>>> >> have
>>>> >> leaked.
>>>> >>
>>>> >> The list I'm printing out at the end of the script should be the
>>>> >> objects
>>>> >> that leaked during the loop over the Enzo dataset.  The
>>>> >> objgraph.get_leaking_objects() function returns the list of all objects
>>>> >> being tracked by the garbage collector that have no references but
>>>> >> still
>>>> >> have nonzero refcounts.
>>>> >>
>>>> >> This means the "original_leaks" list isn't necessarily a list of leaky
>>>> >> objects - most of the things in there are singletons that the
>>>> >> interpreter
>>>> >> keeps around. To create a list of leaky objects produced by iterating
>>>> >> over
>>>> >> the loop I take the set difference of the output of
>>>> >> get_leaking_objects()
>>>> >> before and after iterating over the dataset.
>>>> >>
>>>> >>>
>>>> >>> If you make a bunch of symlinks to
>>>> >>> one flash file and load them all in sequence, does that replicate the
>>>> >>> behavior?
>>>> >>
>>>> >>
>>>> >> Yes, it seems to.  Compare the output of this script:
>>>> >> http://paste.yt-project.org/show/4768/
>>>> >>
>>>> >> Adjust the range of the for loop from 0 to 5 - creating the needed
>>>> >> symlinks
>>>> >> to WindTunnel/windtunnel_4lev_hdf5_plt_cnt_0040 as needed.
>>>> >>
>>>> >>>
>>>> >>>
>>>> >>> On Tue, Jun 10, 2014 at 12:57 PM, Nathan Goldbaum
>>>> >>> <nathan12343 at gmail.com>
>>>> >>> wrote:
>>>> >>> >
>>>> >>> >
>>>> >>> >
>>>> >>> > On Tue, Jun 10, 2014 at 10:45 AM, Matthew Turk
>>>> >>> > <matthewturk at gmail.com>
>>>> >>> > wrote:
>>>> >>> >>
>>>> >>> >> Hi Nathan,
>>>> >>> >>
>>>> >>> >> On Tue, Jun 10, 2014 at 12:43 PM, Nathan Goldbaum
>>>> >>> >> <nathan12343 at gmail.com>
>>>> >>> >> wrote:
>>>> >>> >> >
>>>> >>> >> >
>>>> >>> >> >
>>>> >>> >> > On Tue, Jun 10, 2014 at 6:09 AM, Matthew Turk
>>>> >>> >> > <matthewturk at gmail.com>
>>>> >>> >> > wrote:
>>>> >>> >> >>
>>>> >>> >> >> Hi Nathan,
>>>> >>> >> >>
>>>> >>> >> >> On Mon, Jun 9, 2014 at 11:02 PM, Nathan Goldbaum
>>>> >>> >> >> <nathan12343 at gmail.com>
>>>> >>> >> >> wrote:
>>>> >>> >> >> > Hey all,
>>>> >>> >> >> >
>>>> >>> >> >> > I'm looking at a memory leak that Philip (cc'd) is seeing when
>>>> >>> >> >> > iterating
>>>> >>> >> >> > over a long list of FLASH datasets.  Just as an example of the
>>>> >>> >> >> > type
>>>> >>> >> >> > of
>>>> >>> >> >> > behavior he is seeing - today he left his script running and
>>>> >>> >> >> > ended
>>>> >>> >> >> > up
>>>> >>> >> >> > consuming 300 GB of RAM on a viz node.
>>>> >>> >> >> >
>>>> >>> >> >> > FWIW, the dataset is not particularly large - ~300 outputs and
>>>> >>> >> >> > ~100
>>>> >>> >> >> > MB
>>>> >>> >> >> > per
>>>> >>> >> >> > output. These are also FLASH cylindrical coordinate
>>>> >>> >> >> > simulations -
>>>> >>> >> >> > so
>>>> >>> >> >> > perhaps
>>>> >>> >> >> > this behavior will only occur in curvilinear geometries?
>>>> >>> >> >>
>>>> >>> >> >> Hm, I don't know about that.
>>>> >>> >> >>
>>>> >>> >> >> >
>>>> >>> >> >> > I've been playing with objgraph to try to understand what's
>>>> >>> >> >> > happening.
>>>> >>> >> >> > Here's the script I've been using:
>>>> >>> >> >> > http://paste.yt-project.org/show/4762/
>>>> >>> >> >> >
>>>> >>> >> >> > Here's the output after one iteration of the for loop:
>>>> >>> >> >> > http://paste.yt-project.org/show/4761/
>>>> >>> >> >> >
>>>> >>> >> >> > It seems that for some reason a lot of data is not being
>>>> >>> >> >> > garbage
>>>> >>> >> >> > collected.
>>>> >>> >> >> >
>>>> >>> >> >> > Could there be a reference counting bug somewhere down in a
>>>> >>> >> >> > cython
>>>> >>> >> >> > routine?
>>>> >>> >> >>
>>>> >>> >> >> Based on what you're running, the only Cython routines being
>>>> >>> >> >> called
>>>> >>> >> >> are likely in the selection system.
>>>> >>> >> >>
>>>> >>> >> >> > Objgraph is unable to find backreferences to root grid tiles
>>>> >>> >> >> > in
>>>> >>> >> >> > the
>>>> >>> >> >> > flash
>>>> >>> >> >> > dataset, and all the other yt objects that I've looked at seem
>>>> >>> >> >> > to
>>>> >>> >> >> > have
>>>> >>> >> >> > backreference graphs that terminate at a FLASHGrid object that
>>>> >>> >> >> > represents a
>>>> >>> >> >> > root grid tile in one of the datasets.  That's the best guess
>>>> >>> >> >> > I
>>>> >>> >> >> > have
>>>> >>> >> >> > -
>>>> >>> >> >> > but
>>>> >>> >> >> > definitely nothing conclusive.  I'd appreciate any other ideas
>>>> >>> >> >> > anyone
>>>> >>> >> >> > else
>>>> >>> >> >> > has to help debug this.
>>>> >>> >> >>
>>>> >>> >> >> I'm not entirely sure how to parse the output you've pasted, but
>>>> >>> >> >> I
>>>> >>> >> >> do
>>>> >>> >> >> have a thought.  If you have a reproducible case, I can test it
>>>> >>> >> >> myself.  I am wondering if this could be related to the way that
>>>> >>> >> >> grid
>>>> >>> >> >> masks are cached.  You should be able to test this by adding
>>>> >>> >> >> this
>>>> >>> >> >> line
>>>> >>> >> >> to _get_selector_mask in grid_patch.py, just before "return
>>>> >>> >> >> mask"
>>>> >>> >> >>
>>>> >>> >> >> self._last_mask = self._last_selector_id = None
>>>> >>> >> >>
>>>> >>> >> >> Something like this patch:
>>>> >>> >> >>
>>>> >>> >> >> http://paste.yt-project.org/show/4316/
>>>> >>> >> >
>>>> >>> >> >
>>>> >>> >> > Thanks for the code!  I will look into this today.
>>>> >>> >> >
>>>> >>> >> > Sorry for not explaining the random terminal output I pasted from
>>>> >>> >> > objgraph
>>>> >>> >> > :/
>>>> >>> >> >
>>>> >>> >> > It's a list of objects created after yt operates on one dataset
>>>> >>> >> > and
>>>> >>> >> > after
>>>> >>> >> > the garbage collector is explicitly called. Each iteration of the
>>>> >>> >> > loop
>>>> >>> >> > sees
>>>> >>> >> > the creation of objects representing the FLASH grids, hierarchy,
>>>> >>> >> > and
>>>> >>> >> > associated metadata.  With enough iterations this overhead from
>>>> >>> >> > previous
>>>> >>> >> > loop iterations begins to dominate the total memory budget.
>>>> >>> >>
>>>> >>> >> The code snippet I sent might help reduce it, but I think it speaks
>>>> >>> >> to
>>>> >>> >> a deeper problem in that somehow the FLASH stuff isn't being GC'd
>>>> >>> >> anywhere.  It really ought to be.
>>>> >>> >>
>>>> >>> >> Can you try also doing:
>>>> >>> >>
>>>> >>> >> yt.frontends.flash.FLASHDataset._skip_cache = True
>>>> >>> >
>>>> >>> >
>>>> >>> > No effect, unfortunately.
>>>> >>> >
>>>> >>> >>
>>>> >>> >> and seeing if that helps?
>>>> >>> >>
>>>> >>> >> >
>>>> >>> >> >>
>>>> >>> >> >>
>>>> >>> >> >>
>>>> >>> >> >> -Matt
>>>> >>> >> >>
>>>> >>> >> >> >
>>>> >>> >> >> > Thanks for your help in debugging this!
>>>> >>> >> >> >
>>>> >>> >> >> > -Nathan
>>>> >>> >> >> >
>>>> >>> >> >> _______________________________________________
>>>> >>> >> >> yt-dev mailing list
>>>> >>> >> >> yt-dev at lists.spacepope.org
>>>> >>> >> >> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>>> >>> >> >
>>>> >>> >> >
>>>> >>> >> >
>>>> >>> >> > _______________________________________________
>>>> >>> >> > yt-dev mailing list
>>>> >>> >> > yt-dev at lists.spacepope.org
>>>> >>> >> > http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>>> >>> >> >
>>>> >>> >> _______________________________________________
>>>> >>> >> yt-dev mailing list
>>>> >>> >> yt-dev at lists.spacepope.org
>>>> >>> >> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>>> >>> >
>>>> >>> >
>>>> >>> >
>>>> >>> > _______________________________________________
>>>> >>> > yt-dev mailing list
>>>> >>> > yt-dev at lists.spacepope.org
>>>> >>> > http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>>> >>> >
>>>> >>> _______________________________________________
>>>> >>> yt-dev mailing list
>>>> >>> yt-dev at lists.spacepope.org
>>>> >>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>>> >>
>>>> >>
>>>> >>
>>>> >> _______________________________________________
>>>> >> yt-dev mailing list
>>>> >> yt-dev at lists.spacepope.org
>>>> >> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>>> >>
>>>> _______________________________________________
>>>> yt-dev mailing list
>>>> yt-dev at lists.spacepope.org
>>>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>>
>>>
>>>
>>> _______________________________________________
>>> yt-dev mailing list
>>> yt-dev at lists.spacepope.org
>>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>>



More information about the yt-dev mailing list