[yt-dev] Zombie jobs on eudora?

Nathan Goldbaum nathan12343 at gmail.com
Wed Jun 11 11:58:56 PDT 2014


Could this issue be related?

https://github.com/numpy/numpy/issues/1601

Can you elaborate a bit more about why we're using an object array in the
first place?  If switching to using a list solves these issues perhaps that
is the way to go.


On Wed, Jun 11, 2014 at 7:25 AM, Matthew Turk <matthewturk at gmail.com> wrote:

> I should also note, at some point in the future I want to get rid of
> the object arrays for grids, but that timescale is longer.  Using
> John's grid tree is a much better approach.
>
> On Wed, Jun 11, 2014 at 9:21 AM, Matthew Turk <matthewturk at gmail.com>
> wrote:
> > On Tue, Jun 10, 2014 at 10:53 PM, Matthew Turk <matthewturk at gmail.com>
> wrote:
> >> On Tue, Jun 10, 2014 at 10:50 PM, Nathan Goldbaum <
> nathan12343 at gmail.com> wrote:
> >>> For the leaking YTArrays, Kacper suggested the following patch on IRC:
> >>>
> >>> http://bpaste.net/show/361120/
> >>>
> >>> This works for FLASH but seems to break field detection for enzo.
> >>
> >> I don't think this will ever be a big memory hog, but it is worth
> fixing.
> >>
> >
> > I've spent a small bit of time at this again this morning, and
> > everything seems to come back down to the issue of having a numpy
> > array of grid objects.  If I switch this to a list, the reference
> > counting is correct again and things get deallocated properly.  I've
> > tried a number of ways of changing how they're allocated, but none
> > seem to work for getting the refcount correct.  Oddly enough, if I
> > track both a list *and* an array (i.e., set self._grids =
> > self.grids.tolist()) then the refcounting is correct.
> >
> > I'm sure there's an explanation for this, but I don't know it.  It
> > looks to me like numpy thinks it owns the data and that it should
> > decrement the object refcount.
> >
> > By adding this line:
> >
> > self._grids = self.grids.tolist()
> >
> > after the call to _populate_grid_objects() in grid_geometry_handler, I
> > was able to get all references tracked and removed.
> >
> >>>
> >>>
> >>> On Tue, Jun 10, 2014 at 8:47 PM, Matthew Turk <matthewturk at gmail.com>
> wrote:
> >>>>
> >>>> Hi Nathan,
> >>>>
> >>>> I believe there are two things at work here.
> >>>>
> >>>> 1) (I do not have high confidence of this one.)  YTArrays that are
> >>>> referenced with a .d and turned into numpy arrays which no longer *own
> >>>> the data* may be retaining a reference, but that reference doesn't get
> >>>> freed later.  This happens often when we are doing things in the
> >>>> hierarchy instantiation phase.  I haven't been able to figure out
> >>>> which references get lost; for me, over 40 outputs, I lost 1560.  I
> >>>> think it's 39 YTArrays per hierarchy.  This might also be related to
> >>>> field detection.  I think this is not a substantial contributor.
> >>>> 2) For some reason, when the .grids attribute (object array) is
> >>>> deleted on an index, the refcounts of those grids don't decrease.  I
> >>>> am able to decrease their refcounts by manually setting
> >>>> pf.index.grids[:] = None.  This eliminated all retained grid
> >>>> references.
> >>>>
> >>>> So, I think the root is that at some point, because of circular
> >>>> references or whatever, the finalizer isn't being called on the
> >>>> Gridndex (or on Index itself).  This results in the reference to the
> >>>> grids array being kept, which then pumps up the lost object count.  I
> >>>> don't know why it's not getting called (it's not guaranteed to be
> >>>> called, in any event).
> >>>>
> >>>> I have to take care of some other things (including Brendan's note
> >>>> about the memory problems with particle datasets) but I am pretty sure
> >>>> this is the root.
> >>>>
> >>>> -Matt
> >>>>
> >>>> On Tue, Jun 10, 2014 at 10:13 PM, Matthew Turk <matthewturk at gmail.com
> >
> >>>> wrote:
> >>>> > Hi Nathan,
> >>>> >
> >>>> > All it requires is a call to .index; you don't need to do anything
> >>>> > else to get it to lose references.
> >>>> >
> >>>> > I'm still looking into it.
> >>>> >
> >>>> > -Matt
> >>>> >
> >>>> > On Tue, Jun 10, 2014 at 9:26 PM, Nathan Goldbaum <
> nathan12343 at gmail.com>
> >>>> > wrote:
> >>>> >>
> >>>> >>
> >>>> >>
> >>>> >> On Tue, Jun 10, 2014 at 10:59 AM, Matthew Turk <
> matthewturk at gmail.com>
> >>>> >> wrote:
> >>>> >>>
> >>>> >>> Do you have a reproducible script?
> >>>> >>
> >>>> >>
> >>>> >> This should do the trick: http://paste.yt-project.org/show/4767/
> >>>> >>
> >>>> >> (this is with an enzo dataset by the way)
> >>>> >>
> >>>> >> That script prints (on my machine):
> >>>> >>
> >>>> >> EnzoGrid    15065
> >>>> >> YTArray     1520
> >>>> >> list        704
> >>>> >> dict        2
> >>>> >> MaskedArray 1
> >>>> >>
> >>>> >> Which indicates that 15000 EnzoGrid objects and 1520 YTArray
> objects
> >>>> >> have
> >>>> >> leaked.
> >>>> >>
> >>>> >> The list I'm printing out at the end of the script should be the
> >>>> >> objects
> >>>> >> that leaked during the loop over the Enzo dataset.  The
> >>>> >> objgraph.get_leaking_objects() function returns the list of all
> objects
> >>>> >> being tracked by the garbage collector that have no references but
> >>>> >> still
> >>>> >> have nonzero refcounts.
> >>>> >>
> >>>> >> This means the "original_leaks" list isn't necessarily a list of
> leaky
> >>>> >> objects - most of the things in there are singletons that the
> >>>> >> interpreter
> >>>> >> keeps around. To create a list of leaky objects produced by
> iterating
> >>>> >> over
> >>>> >> the loop I take the set difference of the output of
> >>>> >> get_leaking_objects()
> >>>> >> before and after iterating over the dataset.
> >>>> >>
> >>>> >>>
> >>>> >>> If you make a bunch of symlinks to
> >>>> >>> one flash file and load them all in sequence, does that replicate
> the
> >>>> >>> behavior?
> >>>> >>
> >>>> >>
> >>>> >> Yes, it seems to.  Compare the output of this script:
> >>>> >> http://paste.yt-project.org/show/4768/
> >>>> >>
> >>>> >> Adjust the range of the for loop from 0 to 5 - creating the needed
> >>>> >> symlinks
> >>>> >> to WindTunnel/windtunnel_4lev_hdf5_plt_cnt_0040 as needed.
> >>>> >>
> >>>> >>>
> >>>> >>>
> >>>> >>> On Tue, Jun 10, 2014 at 12:57 PM, Nathan Goldbaum
> >>>> >>> <nathan12343 at gmail.com>
> >>>> >>> wrote:
> >>>> >>> >
> >>>> >>> >
> >>>> >>> >
> >>>> >>> > On Tue, Jun 10, 2014 at 10:45 AM, Matthew Turk
> >>>> >>> > <matthewturk at gmail.com>
> >>>> >>> > wrote:
> >>>> >>> >>
> >>>> >>> >> Hi Nathan,
> >>>> >>> >>
> >>>> >>> >> On Tue, Jun 10, 2014 at 12:43 PM, Nathan Goldbaum
> >>>> >>> >> <nathan12343 at gmail.com>
> >>>> >>> >> wrote:
> >>>> >>> >> >
> >>>> >>> >> >
> >>>> >>> >> >
> >>>> >>> >> > On Tue, Jun 10, 2014 at 6:09 AM, Matthew Turk
> >>>> >>> >> > <matthewturk at gmail.com>
> >>>> >>> >> > wrote:
> >>>> >>> >> >>
> >>>> >>> >> >> Hi Nathan,
> >>>> >>> >> >>
> >>>> >>> >> >> On Mon, Jun 9, 2014 at 11:02 PM, Nathan Goldbaum
> >>>> >>> >> >> <nathan12343 at gmail.com>
> >>>> >>> >> >> wrote:
> >>>> >>> >> >> > Hey all,
> >>>> >>> >> >> >
> >>>> >>> >> >> > I'm looking at a memory leak that Philip (cc'd) is seeing
> when
> >>>> >>> >> >> > iterating
> >>>> >>> >> >> > over a long list of FLASH datasets.  Just as an example
> of the
> >>>> >>> >> >> > type
> >>>> >>> >> >> > of
> >>>> >>> >> >> > behavior he is seeing - today he left his script running
> and
> >>>> >>> >> >> > ended
> >>>> >>> >> >> > up
> >>>> >>> >> >> > consuming 300 GB of RAM on a viz node.
> >>>> >>> >> >> >
> >>>> >>> >> >> > FWIW, the dataset is not particularly large - ~300
> outputs and
> >>>> >>> >> >> > ~100
> >>>> >>> >> >> > MB
> >>>> >>> >> >> > per
> >>>> >>> >> >> > output. These are also FLASH cylindrical coordinate
> >>>> >>> >> >> > simulations -
> >>>> >>> >> >> > so
> >>>> >>> >> >> > perhaps
> >>>> >>> >> >> > this behavior will only occur in curvilinear geometries?
> >>>> >>> >> >>
> >>>> >>> >> >> Hm, I don't know about that.
> >>>> >>> >> >>
> >>>> >>> >> >> >
> >>>> >>> >> >> > I've been playing with objgraph to try to understand
> what's
> >>>> >>> >> >> > happening.
> >>>> >>> >> >> > Here's the script I've been using:
> >>>> >>> >> >> > http://paste.yt-project.org/show/4762/
> >>>> >>> >> >> >
> >>>> >>> >> >> > Here's the output after one iteration of the for loop:
> >>>> >>> >> >> > http://paste.yt-project.org/show/4761/
> >>>> >>> >> >> >
> >>>> >>> >> >> > It seems that for some reason a lot of data is not being
> >>>> >>> >> >> > garbage
> >>>> >>> >> >> > collected.
> >>>> >>> >> >> >
> >>>> >>> >> >> > Could there be a reference counting bug somewhere down in
> a
> >>>> >>> >> >> > cython
> >>>> >>> >> >> > routine?
> >>>> >>> >> >>
> >>>> >>> >> >> Based on what you're running, the only Cython routines being
> >>>> >>> >> >> called
> >>>> >>> >> >> are likely in the selection system.
> >>>> >>> >> >>
> >>>> >>> >> >> > Objgraph is unable to find backreferences to root grid
> tiles
> >>>> >>> >> >> > in
> >>>> >>> >> >> > the
> >>>> >>> >> >> > flash
> >>>> >>> >> >> > dataset, and all the other yt objects that I've looked at
> seem
> >>>> >>> >> >> > to
> >>>> >>> >> >> > have
> >>>> >>> >> >> > backreference graphs that terminate at a FLASHGrid object
> that
> >>>> >>> >> >> > represents a
> >>>> >>> >> >> > root grid tile in one of the datasets.  That's the best
> guess
> >>>> >>> >> >> > I
> >>>> >>> >> >> > have
> >>>> >>> >> >> > -
> >>>> >>> >> >> > but
> >>>> >>> >> >> > definitely nothing conclusive.  I'd appreciate any other
> ideas
> >>>> >>> >> >> > anyone
> >>>> >>> >> >> > else
> >>>> >>> >> >> > has to help debug this.
> >>>> >>> >> >>
> >>>> >>> >> >> I'm not entirely sure how to parse the output you've
> pasted, but
> >>>> >>> >> >> I
> >>>> >>> >> >> do
> >>>> >>> >> >> have a thought.  If you have a reproducible case, I can
> test it
> >>>> >>> >> >> myself.  I am wondering if this could be related to the way
> that
> >>>> >>> >> >> grid
> >>>> >>> >> >> masks are cached.  You should be able to test this by adding
> >>>> >>> >> >> this
> >>>> >>> >> >> line
> >>>> >>> >> >> to _get_selector_mask in grid_patch.py, just before "return
> >>>> >>> >> >> mask"
> >>>> >>> >> >>
> >>>> >>> >> >> self._last_mask = self._last_selector_id = None
> >>>> >>> >> >>
> >>>> >>> >> >> Something like this patch:
> >>>> >>> >> >>
> >>>> >>> >> >> http://paste.yt-project.org/show/4316/
> >>>> >>> >> >
> >>>> >>> >> >
> >>>> >>> >> > Thanks for the code!  I will look into this today.
> >>>> >>> >> >
> >>>> >>> >> > Sorry for not explaining the random terminal output I pasted
> from
> >>>> >>> >> > objgraph
> >>>> >>> >> > :/
> >>>> >>> >> >
> >>>> >>> >> > It's a list of objects created after yt operates on one
> dataset
> >>>> >>> >> > and
> >>>> >>> >> > after
> >>>> >>> >> > the garbage collector is explicitly called. Each iteration
> of the
> >>>> >>> >> > loop
> >>>> >>> >> > sees
> >>>> >>> >> > the creation of objects representing the FLASH grids,
> hierarchy,
> >>>> >>> >> > and
> >>>> >>> >> > associated metadata.  With enough iterations this overhead
> from
> >>>> >>> >> > previous
> >>>> >>> >> > loop iterations begins to dominate the total memory budget.
> >>>> >>> >>
> >>>> >>> >> The code snippet I sent might help reduce it, but I think it
> speaks
> >>>> >>> >> to
> >>>> >>> >> a deeper problem in that somehow the FLASH stuff isn't being
> GC'd
> >>>> >>> >> anywhere.  It really ought to be.
> >>>> >>> >>
> >>>> >>> >> Can you try also doing:
> >>>> >>> >>
> >>>> >>> >> yt.frontends.flash.FLASHDataset._skip_cache = True
> >>>> >>> >
> >>>> >>> >
> >>>> >>> > No effect, unfortunately.
> >>>> >>> >
> >>>> >>> >>
> >>>> >>> >> and seeing if that helps?
> >>>> >>> >>
> >>>> >>> >> >
> >>>> >>> >> >>
> >>>> >>> >> >>
> >>>> >>> >> >>
> >>>> >>> >> >> -Matt
> >>>> >>> >> >>
> >>>> >>> >> >> >
> >>>> >>> >> >> > Thanks for your help in debugging this!
> >>>> >>> >> >> >
> >>>> >>> >> >> > -Nathan
> >>>> >>> >> >> >
> >>>> >>> >> >> _______________________________________________
> >>>> >>> >> >> yt-dev mailing list
> >>>> >>> >> >> yt-dev at lists.spacepope.org
> >>>> >>> >> >>
> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
> >>>> >>> >> >
> >>>> >>> >> >
> >>>> >>> >> >
> >>>> >>> >> > _______________________________________________
> >>>> >>> >> > yt-dev mailing list
> >>>> >>> >> > yt-dev at lists.spacepope.org
> >>>> >>> >> > http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
> >>>> >>> >> >
> >>>> >>> >> _______________________________________________
> >>>> >>> >> yt-dev mailing list
> >>>> >>> >> yt-dev at lists.spacepope.org
> >>>> >>> >> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
> >>>> >>> >
> >>>> >>> >
> >>>> >>> >
> >>>> >>> > _______________________________________________
> >>>> >>> > yt-dev mailing list
> >>>> >>> > yt-dev at lists.spacepope.org
> >>>> >>> > http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
> >>>> >>> >
> >>>> >>> _______________________________________________
> >>>> >>> yt-dev mailing list
> >>>> >>> yt-dev at lists.spacepope.org
> >>>> >>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
> >>>> >>
> >>>> >>
> >>>> >>
> >>>> >> _______________________________________________
> >>>> >> yt-dev mailing list
> >>>> >> yt-dev at lists.spacepope.org
> >>>> >> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
> >>>> >>
> >>>> _______________________________________________
> >>>> yt-dev mailing list
> >>>> yt-dev at lists.spacepope.org
> >>>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> yt-dev mailing list
> >>> yt-dev at lists.spacepope.org
> >>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
> >>>
> _______________________________________________
> yt-dev mailing list
> yt-dev at lists.spacepope.org
> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.spacepope.org/pipermail/yt-dev-spacepope.org/attachments/20140611/a9548b3d/attachment.html>


More information about the yt-dev mailing list