[yt-dev] Zombie jobs on eudora?

Matthew Turk matthewturk at gmail.com
Tue Jun 10 20:53:37 PDT 2014


On Tue, Jun 10, 2014 at 10:50 PM, Nathan Goldbaum <nathan12343 at gmail.com> wrote:
> For the leaking YTArrays, Kacper suggested the following patch on IRC:
>
> http://bpaste.net/show/361120/
>
> This works for FLASH but seems to break field detection for enzo.

I don't think this will ever be a big memory hog, but it is worth fixing.

>
>
> On Tue, Jun 10, 2014 at 8:47 PM, Matthew Turk <matthewturk at gmail.com> wrote:
>>
>> Hi Nathan,
>>
>> I believe there are two things at work here.
>>
>> 1) (I do not have high confidence of this one.)  YTArrays that are
>> referenced with a .d and turned into numpy arrays which no longer *own
>> the data* may be retaining a reference, but that reference doesn't get
>> freed later.  This happens often when we are doing things in the
>> hierarchy instantiation phase.  I haven't been able to figure out
>> which references get lost; for me, over 40 outputs, I lost 1560.  I
>> think it's 39 YTArrays per hierarchy.  This might also be related to
>> field detection.  I think this is not a substantial contributor.
>> 2) For some reason, when the .grids attribute (object array) is
>> deleted on an index, the refcounts of those grids don't decrease.  I
>> am able to decrease their refcounts by manually setting
>> pf.index.grids[:] = None.  This eliminated all retained grid
>> references.
>>
>> So, I think the root is that at some point, because of circular
>> references or whatever, the finalizer isn't being called on the
>> Gridndex (or on Index itself).  This results in the reference to the
>> grids array being kept, which then pumps up the lost object count.  I
>> don't know why it's not getting called (it's not guaranteed to be
>> called, in any event).
>>
>> I have to take care of some other things (including Brendan's note
>> about the memory problems with particle datasets) but I am pretty sure
>> this is the root.
>>
>> -Matt
>>
>> On Tue, Jun 10, 2014 at 10:13 PM, Matthew Turk <matthewturk at gmail.com>
>> wrote:
>> > Hi Nathan,
>> >
>> > All it requires is a call to .index; you don't need to do anything
>> > else to get it to lose references.
>> >
>> > I'm still looking into it.
>> >
>> > -Matt
>> >
>> > On Tue, Jun 10, 2014 at 9:26 PM, Nathan Goldbaum <nathan12343 at gmail.com>
>> > wrote:
>> >>
>> >>
>> >>
>> >> On Tue, Jun 10, 2014 at 10:59 AM, Matthew Turk <matthewturk at gmail.com>
>> >> wrote:
>> >>>
>> >>> Do you have a reproducible script?
>> >>
>> >>
>> >> This should do the trick: http://paste.yt-project.org/show/4767/
>> >>
>> >> (this is with an enzo dataset by the way)
>> >>
>> >> That script prints (on my machine):
>> >>
>> >> EnzoGrid    15065
>> >> YTArray     1520
>> >> list        704
>> >> dict        2
>> >> MaskedArray 1
>> >>
>> >> Which indicates that 15000 EnzoGrid objects and 1520 YTArray objects
>> >> have
>> >> leaked.
>> >>
>> >> The list I'm printing out at the end of the script should be the
>> >> objects
>> >> that leaked during the loop over the Enzo dataset.  The
>> >> objgraph.get_leaking_objects() function returns the list of all objects
>> >> being tracked by the garbage collector that have no references but
>> >> still
>> >> have nonzero refcounts.
>> >>
>> >> This means the "original_leaks" list isn't necessarily a list of leaky
>> >> objects - most of the things in there are singletons that the
>> >> interpreter
>> >> keeps around. To create a list of leaky objects produced by iterating
>> >> over
>> >> the loop I take the set difference of the output of
>> >> get_leaking_objects()
>> >> before and after iterating over the dataset.
>> >>
>> >>>
>> >>> If you make a bunch of symlinks to
>> >>> one flash file and load them all in sequence, does that replicate the
>> >>> behavior?
>> >>
>> >>
>> >> Yes, it seems to.  Compare the output of this script:
>> >> http://paste.yt-project.org/show/4768/
>> >>
>> >> Adjust the range of the for loop from 0 to 5 - creating the needed
>> >> symlinks
>> >> to WindTunnel/windtunnel_4lev_hdf5_plt_cnt_0040 as needed.
>> >>
>> >>>
>> >>>
>> >>> On Tue, Jun 10, 2014 at 12:57 PM, Nathan Goldbaum
>> >>> <nathan12343 at gmail.com>
>> >>> wrote:
>> >>> >
>> >>> >
>> >>> >
>> >>> > On Tue, Jun 10, 2014 at 10:45 AM, Matthew Turk
>> >>> > <matthewturk at gmail.com>
>> >>> > wrote:
>> >>> >>
>> >>> >> Hi Nathan,
>> >>> >>
>> >>> >> On Tue, Jun 10, 2014 at 12:43 PM, Nathan Goldbaum
>> >>> >> <nathan12343 at gmail.com>
>> >>> >> wrote:
>> >>> >> >
>> >>> >> >
>> >>> >> >
>> >>> >> > On Tue, Jun 10, 2014 at 6:09 AM, Matthew Turk
>> >>> >> > <matthewturk at gmail.com>
>> >>> >> > wrote:
>> >>> >> >>
>> >>> >> >> Hi Nathan,
>> >>> >> >>
>> >>> >> >> On Mon, Jun 9, 2014 at 11:02 PM, Nathan Goldbaum
>> >>> >> >> <nathan12343 at gmail.com>
>> >>> >> >> wrote:
>> >>> >> >> > Hey all,
>> >>> >> >> >
>> >>> >> >> > I'm looking at a memory leak that Philip (cc'd) is seeing when
>> >>> >> >> > iterating
>> >>> >> >> > over a long list of FLASH datasets.  Just as an example of the
>> >>> >> >> > type
>> >>> >> >> > of
>> >>> >> >> > behavior he is seeing - today he left his script running and
>> >>> >> >> > ended
>> >>> >> >> > up
>> >>> >> >> > consuming 300 GB of RAM on a viz node.
>> >>> >> >> >
>> >>> >> >> > FWIW, the dataset is not particularly large - ~300 outputs and
>> >>> >> >> > ~100
>> >>> >> >> > MB
>> >>> >> >> > per
>> >>> >> >> > output. These are also FLASH cylindrical coordinate
>> >>> >> >> > simulations -
>> >>> >> >> > so
>> >>> >> >> > perhaps
>> >>> >> >> > this behavior will only occur in curvilinear geometries?
>> >>> >> >>
>> >>> >> >> Hm, I don't know about that.
>> >>> >> >>
>> >>> >> >> >
>> >>> >> >> > I've been playing with objgraph to try to understand what's
>> >>> >> >> > happening.
>> >>> >> >> > Here's the script I've been using:
>> >>> >> >> > http://paste.yt-project.org/show/4762/
>> >>> >> >> >
>> >>> >> >> > Here's the output after one iteration of the for loop:
>> >>> >> >> > http://paste.yt-project.org/show/4761/
>> >>> >> >> >
>> >>> >> >> > It seems that for some reason a lot of data is not being
>> >>> >> >> > garbage
>> >>> >> >> > collected.
>> >>> >> >> >
>> >>> >> >> > Could there be a reference counting bug somewhere down in a
>> >>> >> >> > cython
>> >>> >> >> > routine?
>> >>> >> >>
>> >>> >> >> Based on what you're running, the only Cython routines being
>> >>> >> >> called
>> >>> >> >> are likely in the selection system.
>> >>> >> >>
>> >>> >> >> > Objgraph is unable to find backreferences to root grid tiles
>> >>> >> >> > in
>> >>> >> >> > the
>> >>> >> >> > flash
>> >>> >> >> > dataset, and all the other yt objects that I've looked at seem
>> >>> >> >> > to
>> >>> >> >> > have
>> >>> >> >> > backreference graphs that terminate at a FLASHGrid object that
>> >>> >> >> > represents a
>> >>> >> >> > root grid tile in one of the datasets.  That's the best guess
>> >>> >> >> > I
>> >>> >> >> > have
>> >>> >> >> > -
>> >>> >> >> > but
>> >>> >> >> > definitely nothing conclusive.  I'd appreciate any other ideas
>> >>> >> >> > anyone
>> >>> >> >> > else
>> >>> >> >> > has to help debug this.
>> >>> >> >>
>> >>> >> >> I'm not entirely sure how to parse the output you've pasted, but
>> >>> >> >> I
>> >>> >> >> do
>> >>> >> >> have a thought.  If you have a reproducible case, I can test it
>> >>> >> >> myself.  I am wondering if this could be related to the way that
>> >>> >> >> grid
>> >>> >> >> masks are cached.  You should be able to test this by adding
>> >>> >> >> this
>> >>> >> >> line
>> >>> >> >> to _get_selector_mask in grid_patch.py, just before "return
>> >>> >> >> mask"
>> >>> >> >>
>> >>> >> >> self._last_mask = self._last_selector_id = None
>> >>> >> >>
>> >>> >> >> Something like this patch:
>> >>> >> >>
>> >>> >> >> http://paste.yt-project.org/show/4316/
>> >>> >> >
>> >>> >> >
>> >>> >> > Thanks for the code!  I will look into this today.
>> >>> >> >
>> >>> >> > Sorry for not explaining the random terminal output I pasted from
>> >>> >> > objgraph
>> >>> >> > :/
>> >>> >> >
>> >>> >> > It's a list of objects created after yt operates on one dataset
>> >>> >> > and
>> >>> >> > after
>> >>> >> > the garbage collector is explicitly called. Each iteration of the
>> >>> >> > loop
>> >>> >> > sees
>> >>> >> > the creation of objects representing the FLASH grids, hierarchy,
>> >>> >> > and
>> >>> >> > associated metadata.  With enough iterations this overhead from
>> >>> >> > previous
>> >>> >> > loop iterations begins to dominate the total memory budget.
>> >>> >>
>> >>> >> The code snippet I sent might help reduce it, but I think it speaks
>> >>> >> to
>> >>> >> a deeper problem in that somehow the FLASH stuff isn't being GC'd
>> >>> >> anywhere.  It really ought to be.
>> >>> >>
>> >>> >> Can you try also doing:
>> >>> >>
>> >>> >> yt.frontends.flash.FLASHDataset._skip_cache = True
>> >>> >
>> >>> >
>> >>> > No effect, unfortunately.
>> >>> >
>> >>> >>
>> >>> >> and seeing if that helps?
>> >>> >>
>> >>> >> >
>> >>> >> >>
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> -Matt
>> >>> >> >>
>> >>> >> >> >
>> >>> >> >> > Thanks for your help in debugging this!
>> >>> >> >> >
>> >>> >> >> > -Nathan
>> >>> >> >> >
>> >>> >> >> _______________________________________________
>> >>> >> >> yt-dev mailing list
>> >>> >> >> yt-dev at lists.spacepope.org
>> >>> >> >> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>> >>> >> >
>> >>> >> >
>> >>> >> >
>> >>> >> > _______________________________________________
>> >>> >> > yt-dev mailing list
>> >>> >> > yt-dev at lists.spacepope.org
>> >>> >> > http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>> >>> >> >
>> >>> >> _______________________________________________
>> >>> >> yt-dev mailing list
>> >>> >> yt-dev at lists.spacepope.org
>> >>> >> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>> >>> >
>> >>> >
>> >>> >
>> >>> > _______________________________________________
>> >>> > yt-dev mailing list
>> >>> > yt-dev at lists.spacepope.org
>> >>> > http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>> >>> >
>> >>> _______________________________________________
>> >>> yt-dev mailing list
>> >>> yt-dev at lists.spacepope.org
>> >>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>> >>
>> >>
>> >>
>> >> _______________________________________________
>> >> yt-dev mailing list
>> >> yt-dev at lists.spacepope.org
>> >> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>> >>
>> _______________________________________________
>> yt-dev mailing list
>> yt-dev at lists.spacepope.org
>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>
>
>
> _______________________________________________
> yt-dev mailing list
> yt-dev at lists.spacepope.org
> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>



More information about the yt-dev mailing list