[yt-dev] Zombie jobs on eudora?

Nathan Goldbaum nathan12343 at gmail.com
Wed Jun 11 12:43:24 PDT 2014


On Wed, Jun 11, 2014 at 12:42 PM, Kacper Kowalik <xarthisius.kk at gmail.com>
wrote:

> On 11.06.2014 21:26, Nathan Goldbaum wrote:
> > I can confirm that creating the list alias to .grids does eliminate the
> > Grid objects from showing up in the objgraph output.
> >
> > That said, I'm still seeing steadily increasing memory usage when I
> iterate
> > over a bunch of datasets (http://paste.yt-project.org/show/4773/),
> creating
> > SlicePlots for each one.  I'm not sure yet where the memory is going,
> just
> > that objgraph can't see it.
>
> Hi Nathan,
> if you're iterating over big files, make sure it's not the buffer/cache.
> Cheers,
> Kacper
>

My ignorance about these things is shining through - how would I go about
doing that?


>
> >
> >
> > On Wed, Jun 11, 2014 at 12:02 PM, Matthew Turk <matthewturk at gmail.com>
> > wrote:
> >
> >> On Wed, Jun 11, 2014 at 1:58 PM, Nathan Goldbaum <nathan12343 at gmail.com
> >
> >> wrote:
> >>> Could this issue be related?
> >>>
> >>> https://github.com/numpy/numpy/issues/1601
> >>
> >> Yeah, that's the one.
> >>
> >>>
> >>> Can you elaborate a bit more about why we're using an object array in
> the
> >>> first place?  If switching to using a list solves these issues perhaps
> >> that
> >>> is the way to go.
> >>
> >> Two reasons.  One is that it's an OOM faster for some things that we
> >> do a lot, and the other is that it makes it much easier to index.  We
> >> can do things like selection based on indices or booleans this way.
> >> But we don't do that very often anymore.
> >>
> >> I don't want to switch it to a list; that's a nasty bandaid that
> >> breaks things.  We can just add on an additional list, which is a
> >> nasty bandaid that doesn't break things.  I think the memory overhead
> >> will be minimal for that.  Really fixing it will require moving away
> >> from arrays completely, which we can slot in for 3.1.
> >>
> >>>
> >>>
> >>> On Wed, Jun 11, 2014 at 7:25 AM, Matthew Turk <matthewturk at gmail.com>
> >> wrote:
> >>>>
> >>>> I should also note, at some point in the future I want to get rid of
> >>>> the object arrays for grids, but that timescale is longer.  Using
> >>>> John's grid tree is a much better approach.
> >>>>
> >>>> On Wed, Jun 11, 2014 at 9:21 AM, Matthew Turk <matthewturk at gmail.com>
> >>>> wrote:
> >>>>> On Tue, Jun 10, 2014 at 10:53 PM, Matthew Turk <
> matthewturk at gmail.com
> >>>
> >>>>> wrote:
> >>>>>> On Tue, Jun 10, 2014 at 10:50 PM, Nathan Goldbaum
> >>>>>> <nathan12343 at gmail.com> wrote:
> >>>>>>> For the leaking YTArrays, Kacper suggested the following patch on
> >> IRC:
> >>>>>>>
> >>>>>>> http://bpaste.net/show/361120/
> >>>>>>>
> >>>>>>> This works for FLASH but seems to break field detection for enzo.
> >>>>>>
> >>>>>> I don't think this will ever be a big memory hog, but it is worth
> >>>>>> fixing.
> >>>>>>
> >>>>>
> >>>>> I've spent a small bit of time at this again this morning, and
> >>>>> everything seems to come back down to the issue of having a numpy
> >>>>> array of grid objects.  If I switch this to a list, the reference
> >>>>> counting is correct again and things get deallocated properly.  I've
> >>>>> tried a number of ways of changing how they're allocated, but none
> >>>>> seem to work for getting the refcount correct.  Oddly enough, if I
> >>>>> track both a list *and* an array (i.e., set self._grids =
> >>>>> self.grids.tolist()) then the refcounting is correct.
> >>>>>
> >>>>> I'm sure there's an explanation for this, but I don't know it.  It
> >>>>> looks to me like numpy thinks it owns the data and that it should
> >>>>> decrement the object refcount.
> >>>>>
> >>>>> By adding this line:
> >>>>>
> >>>>> self._grids = self.grids.tolist()
> >>>>>
> >>>>> after the call to _populate_grid_objects() in grid_geometry_handler,
> I
> >>>>> was able to get all references tracked and removed.
> >>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Jun 10, 2014 at 8:47 PM, Matthew Turk <
> >> matthewturk at gmail.com>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Hi Nathan,
> >>>>>>>>
> >>>>>>>> I believe there are two things at work here.
> >>>>>>>>
> >>>>>>>> 1) (I do not have high confidence of this one.)  YTArrays that are
> >>>>>>>> referenced with a .d and turned into numpy arrays which no longer
> >>>>>>>> *own
> >>>>>>>> the data* may be retaining a reference, but that reference doesn't
> >>>>>>>> get
> >>>>>>>> freed later.  This happens often when we are doing things in the
> >>>>>>>> hierarchy instantiation phase.  I haven't been able to figure out
> >>>>>>>> which references get lost; for me, over 40 outputs, I lost 1560.
>  I
> >>>>>>>> think it's 39 YTArrays per hierarchy.  This might also be related
> >> to
> >>>>>>>> field detection.  I think this is not a substantial contributor.
> >>>>>>>> 2) For some reason, when the .grids attribute (object array) is
> >>>>>>>> deleted on an index, the refcounts of those grids don't decrease.
> >>  I
> >>>>>>>> am able to decrease their refcounts by manually setting
> >>>>>>>> pf.index.grids[:] = None.  This eliminated all retained grid
> >>>>>>>> references.
> >>>>>>>>
> >>>>>>>> So, I think the root is that at some point, because of circular
> >>>>>>>> references or whatever, the finalizer isn't being called on the
> >>>>>>>> Gridndex (or on Index itself).  This results in the reference to
> >> the
> >>>>>>>> grids array being kept, which then pumps up the lost object count.
> >>  I
> >>>>>>>> don't know why it's not getting called (it's not guaranteed to be
> >>>>>>>> called, in any event).
> >>>>>>>>
> >>>>>>>> I have to take care of some other things (including Brendan's note
> >>>>>>>> about the memory problems with particle datasets) but I am pretty
> >>>>>>>> sure
> >>>>>>>> this is the root.
> >>>>>>>>
> >>>>>>>> -Matt
> >>>>>>>>
> >>>>>>>> On Tue, Jun 10, 2014 at 10:13 PM, Matthew Turk
> >>>>>>>> <matthewturk at gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>> Hi Nathan,
> >>>>>>>>>
> >>>>>>>>> All it requires is a call to .index; you don't need to do
> >> anything
> >>>>>>>>> else to get it to lose references.
> >>>>>>>>>
> >>>>>>>>> I'm still looking into it.
> >>>>>>>>>
> >>>>>>>>> -Matt
> >>>>>>>>>
> >>>>>>>>> On Tue, Jun 10, 2014 at 9:26 PM, Nathan Goldbaum
> >>>>>>>>> <nathan12343 at gmail.com>
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Jun 10, 2014 at 10:59 AM, Matthew Turk
> >>>>>>>>>> <matthewturk at gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Do you have a reproducible script?
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> This should do the trick:
> >> http://paste.yt-project.org/show/4767/
> >>>>>>>>>>
> >>>>>>>>>> (this is with an enzo dataset by the way)
> >>>>>>>>>>
> >>>>>>>>>> That script prints (on my machine):
> >>>>>>>>>>
> >>>>>>>>>> EnzoGrid    15065
> >>>>>>>>>> YTArray     1520
> >>>>>>>>>> list        704
> >>>>>>>>>> dict        2
> >>>>>>>>>> MaskedArray 1
> >>>>>>>>>>
> >>>>>>>>>> Which indicates that 15000 EnzoGrid objects and 1520 YTArray
> >>>>>>>>>> objects
> >>>>>>>>>> have
> >>>>>>>>>> leaked.
> >>>>>>>>>>
> >>>>>>>>>> The list I'm printing out at the end of the script should be the
> >>>>>>>>>> objects
> >>>>>>>>>> that leaked during the loop over the Enzo dataset.  The
> >>>>>>>>>> objgraph.get_leaking_objects() function returns the list of all
> >>>>>>>>>> objects
> >>>>>>>>>> being tracked by the garbage collector that have no references
> >> but
> >>>>>>>>>> still
> >>>>>>>>>> have nonzero refcounts.
> >>>>>>>>>>
> >>>>>>>>>> This means the "original_leaks" list isn't necessarily a list of
> >>>>>>>>>> leaky
> >>>>>>>>>> objects - most of the things in there are singletons that the
> >>>>>>>>>> interpreter
> >>>>>>>>>> keeps around. To create a list of leaky objects produced by
> >>>>>>>>>> iterating
> >>>>>>>>>> over
> >>>>>>>>>> the loop I take the set difference of the output of
> >>>>>>>>>> get_leaking_objects()
> >>>>>>>>>> before and after iterating over the dataset.
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> If you make a bunch of symlinks to
> >>>>>>>>>>> one flash file and load them all in sequence, does that
> >> replicate
> >>>>>>>>>>> the
> >>>>>>>>>>> behavior?
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Yes, it seems to.  Compare the output of this script:
> >>>>>>>>>> http://paste.yt-project.org/show/4768/
> >>>>>>>>>>
> >>>>>>>>>> Adjust the range of the for loop from 0 to 5 - creating the
> >> needed
> >>>>>>>>>> symlinks
> >>>>>>>>>> to WindTunnel/windtunnel_4lev_hdf5_plt_cnt_0040 as needed.
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Jun 10, 2014 at 12:57 PM, Nathan Goldbaum
> >>>>>>>>>>> <nathan12343 at gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Tue, Jun 10, 2014 at 10:45 AM, Matthew Turk
> >>>>>>>>>>>> <matthewturk at gmail.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Nathan,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Tue, Jun 10, 2014 at 12:43 PM, Nathan Goldbaum
> >>>>>>>>>>>>> <nathan12343 at gmail.com>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Tue, Jun 10, 2014 at 6:09 AM, Matthew Turk
> >>>>>>>>>>>>>> <matthewturk at gmail.com>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi Nathan,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Mon, Jun 9, 2014 at 11:02 PM, Nathan Goldbaum
> >>>>>>>>>>>>>>> <nathan12343 at gmail.com>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>> Hey all,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I'm looking at a memory leak that Philip (cc'd) is
> >> seeing
> >>>>>>>>>>>>>>>> when
> >>>>>>>>>>>>>>>> iterating
> >>>>>>>>>>>>>>>> over a long list of FLASH datasets.  Just as an example
> >>>>>>>>>>>>>>>> of the
> >>>>>>>>>>>>>>>> type
> >>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>> behavior he is seeing - today he left his script
> >> running
> >>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>> ended
> >>>>>>>>>>>>>>>> up
> >>>>>>>>>>>>>>>> consuming 300 GB of RAM on a viz node.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> FWIW, the dataset is not particularly large - ~300
> >>>>>>>>>>>>>>>> outputs and
> >>>>>>>>>>>>>>>> ~100
> >>>>>>>>>>>>>>>> MB
> >>>>>>>>>>>>>>>> per
> >>>>>>>>>>>>>>>> output. These are also FLASH cylindrical coordinate
> >>>>>>>>>>>>>>>> simulations -
> >>>>>>>>>>>>>>>> so
> >>>>>>>>>>>>>>>> perhaps
> >>>>>>>>>>>>>>>> this behavior will only occur in curvilinear
> >> geometries?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hm, I don't know about that.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I've been playing with objgraph to try to understand
> >>>>>>>>>>>>>>>> what's
> >>>>>>>>>>>>>>>> happening.
> >>>>>>>>>>>>>>>> Here's the script I've been using:
> >>>>>>>>>>>>>>>> http://paste.yt-project.org/show/4762/
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Here's the output after one iteration of the for loop:
> >>>>>>>>>>>>>>>> http://paste.yt-project.org/show/4761/
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> It seems that for some reason a lot of data is not
> >> being
> >>>>>>>>>>>>>>>> garbage
> >>>>>>>>>>>>>>>> collected.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Could there be a reference counting bug somewhere down
> >> in
> >>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>> cython
> >>>>>>>>>>>>>>>> routine?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Based on what you're running, the only Cython routines
> >>>>>>>>>>>>>>> being
> >>>>>>>>>>>>>>> called
> >>>>>>>>>>>>>>> are likely in the selection system.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Objgraph is unable to find backreferences to root grid
> >>>>>>>>>>>>>>>> tiles
> >>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> flash
> >>>>>>>>>>>>>>>> dataset, and all the other yt objects that I've looked
> >> at
> >>>>>>>>>>>>>>>> seem
> >>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>> backreference graphs that terminate at a FLASHGrid
> >> object
> >>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>> represents a
> >>>>>>>>>>>>>>>> root grid tile in one of the datasets.  That's the best
> >>>>>>>>>>>>>>>> guess
> >>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>> -
> >>>>>>>>>>>>>>>> but
> >>>>>>>>>>>>>>>> definitely nothing conclusive.  I'd appreciate any
> >> other
> >>>>>>>>>>>>>>>> ideas
> >>>>>>>>>>>>>>>> anyone
> >>>>>>>>>>>>>>>> else
> >>>>>>>>>>>>>>>> has to help debug this.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I'm not entirely sure how to parse the output you've
> >>>>>>>>>>>>>>> pasted, but
> >>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>> do
> >>>>>>>>>>>>>>> have a thought.  If you have a reproducible case, I can
> >>>>>>>>>>>>>>> test it
> >>>>>>>>>>>>>>> myself.  I am wondering if this could be related to the
> >> way
> >>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>> grid
> >>>>>>>>>>>>>>> masks are cached.  You should be able to test this by
> >>>>>>>>>>>>>>> adding
> >>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>> line
> >>>>>>>>>>>>>>> to _get_selector_mask in grid_patch.py, just before
> >> "return
> >>>>>>>>>>>>>>> mask"
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> self._last_mask = self._last_selector_id = None
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Something like this patch:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> http://paste.yt-project.org/show/4316/
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks for the code!  I will look into this today.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Sorry for not explaining the random terminal output I
> >> pasted
> >>>>>>>>>>>>>> from
> >>>>>>>>>>>>>> objgraph
> >>>>>>>>>>>>>> :/
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> It's a list of objects created after yt operates on one
> >>>>>>>>>>>>>> dataset
> >>>>>>>>>>>>>> and
> >>>>>>>>>>>>>> after
> >>>>>>>>>>>>>> the garbage collector is explicitly called. Each iteration
> >>>>>>>>>>>>>> of the
> >>>>>>>>>>>>>> loop
> >>>>>>>>>>>>>> sees
> >>>>>>>>>>>>>> the creation of objects representing the FLASH grids,
> >>>>>>>>>>>>>> hierarchy,
> >>>>>>>>>>>>>> and
> >>>>>>>>>>>>>> associated metadata.  With enough iterations this overhead
> >>>>>>>>>>>>>> from
> >>>>>>>>>>>>>> previous
> >>>>>>>>>>>>>> loop iterations begins to dominate the total memory
> >> budget.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The code snippet I sent might help reduce it, but I think it
> >>>>>>>>>>>>> speaks
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>> a deeper problem in that somehow the FLASH stuff isn't being
> >>>>>>>>>>>>> GC'd
> >>>>>>>>>>>>> anywhere.  It really ought to be.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Can you try also doing:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> yt.frontends.flash.FLASHDataset._skip_cache = True
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> No effect, unfortunately.
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> and seeing if that helps?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> -Matt
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks for your help in debugging this!
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> -Nathan
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>>>>> yt-dev mailing list
> >>>>>>>>>>>>>>> yt-dev at lists.spacepope.org
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>>>> yt-dev mailing list
> >>>>>>>>>>>>>> yt-dev at lists.spacepope.org
> >>>>>>>>>>>>>>
> >> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>>> yt-dev mailing list
> >>>>>>>>>>>>> yt-dev at lists.spacepope.org
> >>>>>>>>>>>>>
> >> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>> yt-dev mailing list
> >>>>>>>>>>>> yt-dev at lists.spacepope.org
> >>>>>>>>>>>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
> >>>>>>>>>>>>
> >>>>>>>>>>> _______________________________________________
> >>>>>>>>>>> yt-dev mailing list
> >>>>>>>>>>> yt-dev at lists.spacepope.org
> >>>>>>>>>>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> yt-dev mailing list
> >>>>>>>>>> yt-dev at lists.spacepope.org
> >>>>>>>>>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
> >>>>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> yt-dev mailing list
> >>>>>>>> yt-dev at lists.spacepope.org
> >>>>>>>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> yt-dev mailing list
> >>>>>>> yt-dev at lists.spacepope.org
> >>>>>>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
> >>>>>>>
> >>>> _______________________________________________
> >>>> yt-dev mailing list
> >>>> yt-dev at lists.spacepope.org
> >>>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> yt-dev mailing list
> >>> yt-dev at lists.spacepope.org
> >>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
> >>>
> >> _______________________________________________
> >> yt-dev mailing list
> >> yt-dev at lists.spacepope.org
> >> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
> >>
> >
> >
> >
> > _______________________________________________
> > yt-dev mailing list
> > yt-dev at lists.spacepope.org
> > http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
> >
>
>
>
> _______________________________________________
> yt-dev mailing list
> yt-dev at lists.spacepope.org
> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.spacepope.org/pipermail/yt-dev-spacepope.org/attachments/20140611/b6d9091a/attachment.html>


More information about the yt-dev mailing list