[yt-dev] Zombie jobs on eudora?

Nathan Goldbaum nathan12343 at gmail.com
Wed Jun 11 12:26:55 PDT 2014


I can confirm that creating the list alias to .grids does eliminate the
Grid objects from showing up in the objgraph output.

That said, I'm still seeing steadily increasing memory usage when I iterate
over a bunch of datasets (http://paste.yt-project.org/show/4773/), creating
SlicePlots for each one.  I'm not sure yet where the memory is going, just
that objgraph can't see it.


On Wed, Jun 11, 2014 at 12:02 PM, Matthew Turk <matthewturk at gmail.com>
wrote:

> On Wed, Jun 11, 2014 at 1:58 PM, Nathan Goldbaum <nathan12343 at gmail.com>
> wrote:
> > Could this issue be related?
> >
> > https://github.com/numpy/numpy/issues/1601
>
> Yeah, that's the one.
>
> >
> > Can you elaborate a bit more about why we're using an object array in the
> > first place?  If switching to using a list solves these issues perhaps
> that
> > is the way to go.
>
> Two reasons.  One is that it's an OOM faster for some things that we
> do a lot, and the other is that it makes it much easier to index.  We
> can do things like selection based on indices or booleans this way.
> But we don't do that very often anymore.
>
> I don't want to switch it to a list; that's a nasty bandaid that
> breaks things.  We can just add on an additional list, which is a
> nasty bandaid that doesn't break things.  I think the memory overhead
> will be minimal for that.  Really fixing it will require moving away
> from arrays completely, which we can slot in for 3.1.
>
> >
> >
> > On Wed, Jun 11, 2014 at 7:25 AM, Matthew Turk <matthewturk at gmail.com>
> wrote:
> >>
> >> I should also note, at some point in the future I want to get rid of
> >> the object arrays for grids, but that timescale is longer.  Using
> >> John's grid tree is a much better approach.
> >>
> >> On Wed, Jun 11, 2014 at 9:21 AM, Matthew Turk <matthewturk at gmail.com>
> >> wrote:
> >> > On Tue, Jun 10, 2014 at 10:53 PM, Matthew Turk <matthewturk at gmail.com
> >
> >> > wrote:
> >> >> On Tue, Jun 10, 2014 at 10:50 PM, Nathan Goldbaum
> >> >> <nathan12343 at gmail.com> wrote:
> >> >>> For the leaking YTArrays, Kacper suggested the following patch on
> IRC:
> >> >>>
> >> >>> http://bpaste.net/show/361120/
> >> >>>
> >> >>> This works for FLASH but seems to break field detection for enzo.
> >> >>
> >> >> I don't think this will ever be a big memory hog, but it is worth
> >> >> fixing.
> >> >>
> >> >
> >> > I've spent a small bit of time at this again this morning, and
> >> > everything seems to come back down to the issue of having a numpy
> >> > array of grid objects.  If I switch this to a list, the reference
> >> > counting is correct again and things get deallocated properly.  I've
> >> > tried a number of ways of changing how they're allocated, but none
> >> > seem to work for getting the refcount correct.  Oddly enough, if I
> >> > track both a list *and* an array (i.e., set self._grids =
> >> > self.grids.tolist()) then the refcounting is correct.
> >> >
> >> > I'm sure there's an explanation for this, but I don't know it.  It
> >> > looks to me like numpy thinks it owns the data and that it should
> >> > decrement the object refcount.
> >> >
> >> > By adding this line:
> >> >
> >> > self._grids = self.grids.tolist()
> >> >
> >> > after the call to _populate_grid_objects() in grid_geometry_handler, I
> >> > was able to get all references tracked and removed.
> >> >
> >> >>>
> >> >>>
> >> >>> On Tue, Jun 10, 2014 at 8:47 PM, Matthew Turk <
> matthewturk at gmail.com>
> >> >>> wrote:
> >> >>>>
> >> >>>> Hi Nathan,
> >> >>>>
> >> >>>> I believe there are two things at work here.
> >> >>>>
> >> >>>> 1) (I do not have high confidence of this one.)  YTArrays that are
> >> >>>> referenced with a .d and turned into numpy arrays which no longer
> >> >>>> *own
> >> >>>> the data* may be retaining a reference, but that reference doesn't
> >> >>>> get
> >> >>>> freed later.  This happens often when we are doing things in the
> >> >>>> hierarchy instantiation phase.  I haven't been able to figure out
> >> >>>> which references get lost; for me, over 40 outputs, I lost 1560.  I
> >> >>>> think it's 39 YTArrays per hierarchy.  This might also be related
> to
> >> >>>> field detection.  I think this is not a substantial contributor.
> >> >>>> 2) For some reason, when the .grids attribute (object array) is
> >> >>>> deleted on an index, the refcounts of those grids don't decrease.
>  I
> >> >>>> am able to decrease their refcounts by manually setting
> >> >>>> pf.index.grids[:] = None.  This eliminated all retained grid
> >> >>>> references.
> >> >>>>
> >> >>>> So, I think the root is that at some point, because of circular
> >> >>>> references or whatever, the finalizer isn't being called on the
> >> >>>> Gridndex (or on Index itself).  This results in the reference to
> the
> >> >>>> grids array being kept, which then pumps up the lost object count.
>  I
> >> >>>> don't know why it's not getting called (it's not guaranteed to be
> >> >>>> called, in any event).
> >> >>>>
> >> >>>> I have to take care of some other things (including Brendan's note
> >> >>>> about the memory problems with particle datasets) but I am pretty
> >> >>>> sure
> >> >>>> this is the root.
> >> >>>>
> >> >>>> -Matt
> >> >>>>
> >> >>>> On Tue, Jun 10, 2014 at 10:13 PM, Matthew Turk
> >> >>>> <matthewturk at gmail.com>
> >> >>>> wrote:
> >> >>>> > Hi Nathan,
> >> >>>> >
> >> >>>> > All it requires is a call to .index; you don't need to do
> anything
> >> >>>> > else to get it to lose references.
> >> >>>> >
> >> >>>> > I'm still looking into it.
> >> >>>> >
> >> >>>> > -Matt
> >> >>>> >
> >> >>>> > On Tue, Jun 10, 2014 at 9:26 PM, Nathan Goldbaum
> >> >>>> > <nathan12343 at gmail.com>
> >> >>>> > wrote:
> >> >>>> >>
> >> >>>> >>
> >> >>>> >>
> >> >>>> >> On Tue, Jun 10, 2014 at 10:59 AM, Matthew Turk
> >> >>>> >> <matthewturk at gmail.com>
> >> >>>> >> wrote:
> >> >>>> >>>
> >> >>>> >>> Do you have a reproducible script?
> >> >>>> >>
> >> >>>> >>
> >> >>>> >> This should do the trick:
> http://paste.yt-project.org/show/4767/
> >> >>>> >>
> >> >>>> >> (this is with an enzo dataset by the way)
> >> >>>> >>
> >> >>>> >> That script prints (on my machine):
> >> >>>> >>
> >> >>>> >> EnzoGrid    15065
> >> >>>> >> YTArray     1520
> >> >>>> >> list        704
> >> >>>> >> dict        2
> >> >>>> >> MaskedArray 1
> >> >>>> >>
> >> >>>> >> Which indicates that 15000 EnzoGrid objects and 1520 YTArray
> >> >>>> >> objects
> >> >>>> >> have
> >> >>>> >> leaked.
> >> >>>> >>
> >> >>>> >> The list I'm printing out at the end of the script should be the
> >> >>>> >> objects
> >> >>>> >> that leaked during the loop over the Enzo dataset.  The
> >> >>>> >> objgraph.get_leaking_objects() function returns the list of all
> >> >>>> >> objects
> >> >>>> >> being tracked by the garbage collector that have no references
> but
> >> >>>> >> still
> >> >>>> >> have nonzero refcounts.
> >> >>>> >>
> >> >>>> >> This means the "original_leaks" list isn't necessarily a list of
> >> >>>> >> leaky
> >> >>>> >> objects - most of the things in there are singletons that the
> >> >>>> >> interpreter
> >> >>>> >> keeps around. To create a list of leaky objects produced by
> >> >>>> >> iterating
> >> >>>> >> over
> >> >>>> >> the loop I take the set difference of the output of
> >> >>>> >> get_leaking_objects()
> >> >>>> >> before and after iterating over the dataset.
> >> >>>> >>
> >> >>>> >>>
> >> >>>> >>> If you make a bunch of symlinks to
> >> >>>> >>> one flash file and load them all in sequence, does that
> replicate
> >> >>>> >>> the
> >> >>>> >>> behavior?
> >> >>>> >>
> >> >>>> >>
> >> >>>> >> Yes, it seems to.  Compare the output of this script:
> >> >>>> >> http://paste.yt-project.org/show/4768/
> >> >>>> >>
> >> >>>> >> Adjust the range of the for loop from 0 to 5 - creating the
> needed
> >> >>>> >> symlinks
> >> >>>> >> to WindTunnel/windtunnel_4lev_hdf5_plt_cnt_0040 as needed.
> >> >>>> >>
> >> >>>> >>>
> >> >>>> >>>
> >> >>>> >>> On Tue, Jun 10, 2014 at 12:57 PM, Nathan Goldbaum
> >> >>>> >>> <nathan12343 at gmail.com>
> >> >>>> >>> wrote:
> >> >>>> >>> >
> >> >>>> >>> >
> >> >>>> >>> >
> >> >>>> >>> > On Tue, Jun 10, 2014 at 10:45 AM, Matthew Turk
> >> >>>> >>> > <matthewturk at gmail.com>
> >> >>>> >>> > wrote:
> >> >>>> >>> >>
> >> >>>> >>> >> Hi Nathan,
> >> >>>> >>> >>
> >> >>>> >>> >> On Tue, Jun 10, 2014 at 12:43 PM, Nathan Goldbaum
> >> >>>> >>> >> <nathan12343 at gmail.com>
> >> >>>> >>> >> wrote:
> >> >>>> >>> >> >
> >> >>>> >>> >> >
> >> >>>> >>> >> >
> >> >>>> >>> >> > On Tue, Jun 10, 2014 at 6:09 AM, Matthew Turk
> >> >>>> >>> >> > <matthewturk at gmail.com>
> >> >>>> >>> >> > wrote:
> >> >>>> >>> >> >>
> >> >>>> >>> >> >> Hi Nathan,
> >> >>>> >>> >> >>
> >> >>>> >>> >> >> On Mon, Jun 9, 2014 at 11:02 PM, Nathan Goldbaum
> >> >>>> >>> >> >> <nathan12343 at gmail.com>
> >> >>>> >>> >> >> wrote:
> >> >>>> >>> >> >> > Hey all,
> >> >>>> >>> >> >> >
> >> >>>> >>> >> >> > I'm looking at a memory leak that Philip (cc'd) is
> seeing
> >> >>>> >>> >> >> > when
> >> >>>> >>> >> >> > iterating
> >> >>>> >>> >> >> > over a long list of FLASH datasets.  Just as an example
> >> >>>> >>> >> >> > of the
> >> >>>> >>> >> >> > type
> >> >>>> >>> >> >> > of
> >> >>>> >>> >> >> > behavior he is seeing - today he left his script
> running
> >> >>>> >>> >> >> > and
> >> >>>> >>> >> >> > ended
> >> >>>> >>> >> >> > up
> >> >>>> >>> >> >> > consuming 300 GB of RAM on a viz node.
> >> >>>> >>> >> >> >
> >> >>>> >>> >> >> > FWIW, the dataset is not particularly large - ~300
> >> >>>> >>> >> >> > outputs and
> >> >>>> >>> >> >> > ~100
> >> >>>> >>> >> >> > MB
> >> >>>> >>> >> >> > per
> >> >>>> >>> >> >> > output. These are also FLASH cylindrical coordinate
> >> >>>> >>> >> >> > simulations -
> >> >>>> >>> >> >> > so
> >> >>>> >>> >> >> > perhaps
> >> >>>> >>> >> >> > this behavior will only occur in curvilinear
> geometries?
> >> >>>> >>> >> >>
> >> >>>> >>> >> >> Hm, I don't know about that.
> >> >>>> >>> >> >>
> >> >>>> >>> >> >> >
> >> >>>> >>> >> >> > I've been playing with objgraph to try to understand
> >> >>>> >>> >> >> > what's
> >> >>>> >>> >> >> > happening.
> >> >>>> >>> >> >> > Here's the script I've been using:
> >> >>>> >>> >> >> > http://paste.yt-project.org/show/4762/
> >> >>>> >>> >> >> >
> >> >>>> >>> >> >> > Here's the output after one iteration of the for loop:
> >> >>>> >>> >> >> > http://paste.yt-project.org/show/4761/
> >> >>>> >>> >> >> >
> >> >>>> >>> >> >> > It seems that for some reason a lot of data is not
> being
> >> >>>> >>> >> >> > garbage
> >> >>>> >>> >> >> > collected.
> >> >>>> >>> >> >> >
> >> >>>> >>> >> >> > Could there be a reference counting bug somewhere down
> in
> >> >>>> >>> >> >> > a
> >> >>>> >>> >> >> > cython
> >> >>>> >>> >> >> > routine?
> >> >>>> >>> >> >>
> >> >>>> >>> >> >> Based on what you're running, the only Cython routines
> >> >>>> >>> >> >> being
> >> >>>> >>> >> >> called
> >> >>>> >>> >> >> are likely in the selection system.
> >> >>>> >>> >> >>
> >> >>>> >>> >> >> > Objgraph is unable to find backreferences to root grid
> >> >>>> >>> >> >> > tiles
> >> >>>> >>> >> >> > in
> >> >>>> >>> >> >> > the
> >> >>>> >>> >> >> > flash
> >> >>>> >>> >> >> > dataset, and all the other yt objects that I've looked
> at
> >> >>>> >>> >> >> > seem
> >> >>>> >>> >> >> > to
> >> >>>> >>> >> >> > have
> >> >>>> >>> >> >> > backreference graphs that terminate at a FLASHGrid
> object
> >> >>>> >>> >> >> > that
> >> >>>> >>> >> >> > represents a
> >> >>>> >>> >> >> > root grid tile in one of the datasets.  That's the best
> >> >>>> >>> >> >> > guess
> >> >>>> >>> >> >> > I
> >> >>>> >>> >> >> > have
> >> >>>> >>> >> >> > -
> >> >>>> >>> >> >> > but
> >> >>>> >>> >> >> > definitely nothing conclusive.  I'd appreciate any
> other
> >> >>>> >>> >> >> > ideas
> >> >>>> >>> >> >> > anyone
> >> >>>> >>> >> >> > else
> >> >>>> >>> >> >> > has to help debug this.
> >> >>>> >>> >> >>
> >> >>>> >>> >> >> I'm not entirely sure how to parse the output you've
> >> >>>> >>> >> >> pasted, but
> >> >>>> >>> >> >> I
> >> >>>> >>> >> >> do
> >> >>>> >>> >> >> have a thought.  If you have a reproducible case, I can
> >> >>>> >>> >> >> test it
> >> >>>> >>> >> >> myself.  I am wondering if this could be related to the
> way
> >> >>>> >>> >> >> that
> >> >>>> >>> >> >> grid
> >> >>>> >>> >> >> masks are cached.  You should be able to test this by
> >> >>>> >>> >> >> adding
> >> >>>> >>> >> >> this
> >> >>>> >>> >> >> line
> >> >>>> >>> >> >> to _get_selector_mask in grid_patch.py, just before
> "return
> >> >>>> >>> >> >> mask"
> >> >>>> >>> >> >>
> >> >>>> >>> >> >> self._last_mask = self._last_selector_id = None
> >> >>>> >>> >> >>
> >> >>>> >>> >> >> Something like this patch:
> >> >>>> >>> >> >>
> >> >>>> >>> >> >> http://paste.yt-project.org/show/4316/
> >> >>>> >>> >> >
> >> >>>> >>> >> >
> >> >>>> >>> >> > Thanks for the code!  I will look into this today.
> >> >>>> >>> >> >
> >> >>>> >>> >> > Sorry for not explaining the random terminal output I
> pasted
> >> >>>> >>> >> > from
> >> >>>> >>> >> > objgraph
> >> >>>> >>> >> > :/
> >> >>>> >>> >> >
> >> >>>> >>> >> > It's a list of objects created after yt operates on one
> >> >>>> >>> >> > dataset
> >> >>>> >>> >> > and
> >> >>>> >>> >> > after
> >> >>>> >>> >> > the garbage collector is explicitly called. Each iteration
> >> >>>> >>> >> > of the
> >> >>>> >>> >> > loop
> >> >>>> >>> >> > sees
> >> >>>> >>> >> > the creation of objects representing the FLASH grids,
> >> >>>> >>> >> > hierarchy,
> >> >>>> >>> >> > and
> >> >>>> >>> >> > associated metadata.  With enough iterations this overhead
> >> >>>> >>> >> > from
> >> >>>> >>> >> > previous
> >> >>>> >>> >> > loop iterations begins to dominate the total memory
> budget.
> >> >>>> >>> >>
> >> >>>> >>> >> The code snippet I sent might help reduce it, but I think it
> >> >>>> >>> >> speaks
> >> >>>> >>> >> to
> >> >>>> >>> >> a deeper problem in that somehow the FLASH stuff isn't being
> >> >>>> >>> >> GC'd
> >> >>>> >>> >> anywhere.  It really ought to be.
> >> >>>> >>> >>
> >> >>>> >>> >> Can you try also doing:
> >> >>>> >>> >>
> >> >>>> >>> >> yt.frontends.flash.FLASHDataset._skip_cache = True
> >> >>>> >>> >
> >> >>>> >>> >
> >> >>>> >>> > No effect, unfortunately.
> >> >>>> >>> >
> >> >>>> >>> >>
> >> >>>> >>> >> and seeing if that helps?
> >> >>>> >>> >>
> >> >>>> >>> >> >
> >> >>>> >>> >> >>
> >> >>>> >>> >> >>
> >> >>>> >>> >> >>
> >> >>>> >>> >> >> -Matt
> >> >>>> >>> >> >>
> >> >>>> >>> >> >> >
> >> >>>> >>> >> >> > Thanks for your help in debugging this!
> >> >>>> >>> >> >> >
> >> >>>> >>> >> >> > -Nathan
> >> >>>> >>> >> >> >
> >> >>>> >>> >> >> _______________________________________________
> >> >>>> >>> >> >> yt-dev mailing list
> >> >>>> >>> >> >> yt-dev at lists.spacepope.org
> >> >>>> >>> >> >>
> >> >>>> >>> >> >>
> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
> >> >>>> >>> >> >
> >> >>>> >>> >> >
> >> >>>> >>> >> >
> >> >>>> >>> >> > _______________________________________________
> >> >>>> >>> >> > yt-dev mailing list
> >> >>>> >>> >> > yt-dev at lists.spacepope.org
> >> >>>> >>> >> >
> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
> >> >>>> >>> >> >
> >> >>>> >>> >> _______________________________________________
> >> >>>> >>> >> yt-dev mailing list
> >> >>>> >>> >> yt-dev at lists.spacepope.org
> >> >>>> >>> >>
> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
> >> >>>> >>> >
> >> >>>> >>> >
> >> >>>> >>> >
> >> >>>> >>> > _______________________________________________
> >> >>>> >>> > yt-dev mailing list
> >> >>>> >>> > yt-dev at lists.spacepope.org
> >> >>>> >>> > http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
> >> >>>> >>> >
> >> >>>> >>> _______________________________________________
> >> >>>> >>> yt-dev mailing list
> >> >>>> >>> yt-dev at lists.spacepope.org
> >> >>>> >>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
> >> >>>> >>
> >> >>>> >>
> >> >>>> >>
> >> >>>> >> _______________________________________________
> >> >>>> >> yt-dev mailing list
> >> >>>> >> yt-dev at lists.spacepope.org
> >> >>>> >> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
> >> >>>> >>
> >> >>>> _______________________________________________
> >> >>>> yt-dev mailing list
> >> >>>> yt-dev at lists.spacepope.org
> >> >>>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
> >> >>>
> >> >>>
> >> >>>
> >> >>> _______________________________________________
> >> >>> yt-dev mailing list
> >> >>> yt-dev at lists.spacepope.org
> >> >>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
> >> >>>
> >> _______________________________________________
> >> yt-dev mailing list
> >> yt-dev at lists.spacepope.org
> >> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
> >
> >
> >
> > _______________________________________________
> > yt-dev mailing list
> > yt-dev at lists.spacepope.org
> > http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
> >
> _______________________________________________
> yt-dev mailing list
> yt-dev at lists.spacepope.org
> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.spacepope.org/pipermail/yt-dev-spacepope.org/attachments/20140611/cd247de2/attachment.htm>


More information about the yt-dev mailing list