[yt-dev] Zombie jobs on eudora?

Matthew Turk matthewturk at gmail.com
Wed Jun 11 12:30:55 PDT 2014


Can you try with:

 * Just a slice
 * Just a slice you've accessed 'ones' in
 * Slice => FRB
 * Just pf.index

On Wed, Jun 11, 2014 at 2:26 PM, Nathan Goldbaum <nathan12343 at gmail.com> wrote:
> I can confirm that creating the list alias to .grids does eliminate the Grid
> objects from showing up in the objgraph output.
>
> That said, I'm still seeing steadily increasing memory usage when I iterate
> over a bunch of datasets (http://paste.yt-project.org/show/4773/), creating
> SlicePlots for each one.  I'm not sure yet where the memory is going, just
> that objgraph can't see it.
>
>
> On Wed, Jun 11, 2014 at 12:02 PM, Matthew Turk <matthewturk at gmail.com>
> wrote:
>>
>> On Wed, Jun 11, 2014 at 1:58 PM, Nathan Goldbaum <nathan12343 at gmail.com>
>> wrote:
>> > Could this issue be related?
>> >
>> > https://github.com/numpy/numpy/issues/1601
>>
>> Yeah, that's the one.
>>
>> >
>> > Can you elaborate a bit more about why we're using an object array in
>> > the
>> > first place?  If switching to using a list solves these issues perhaps
>> > that
>> > is the way to go.
>>
>> Two reasons.  One is that it's an OOM faster for some things that we
>> do a lot, and the other is that it makes it much easier to index.  We
>> can do things like selection based on indices or booleans this way.
>> But we don't do that very often anymore.
>>
>> I don't want to switch it to a list; that's a nasty bandaid that
>> breaks things.  We can just add on an additional list, which is a
>> nasty bandaid that doesn't break things.  I think the memory overhead
>> will be minimal for that.  Really fixing it will require moving away
>> from arrays completely, which we can slot in for 3.1.
>>
>> >
>> >
>> > On Wed, Jun 11, 2014 at 7:25 AM, Matthew Turk <matthewturk at gmail.com>
>> > wrote:
>> >>
>> >> I should also note, at some point in the future I want to get rid of
>> >> the object arrays for grids, but that timescale is longer.  Using
>> >> John's grid tree is a much better approach.
>> >>
>> >> On Wed, Jun 11, 2014 at 9:21 AM, Matthew Turk <matthewturk at gmail.com>
>> >> wrote:
>> >> > On Tue, Jun 10, 2014 at 10:53 PM, Matthew Turk
>> >> > <matthewturk at gmail.com>
>> >> > wrote:
>> >> >> On Tue, Jun 10, 2014 at 10:50 PM, Nathan Goldbaum
>> >> >> <nathan12343 at gmail.com> wrote:
>> >> >>> For the leaking YTArrays, Kacper suggested the following patch on
>> >> >>> IRC:
>> >> >>>
>> >> >>> http://bpaste.net/show/361120/
>> >> >>>
>> >> >>> This works for FLASH but seems to break field detection for enzo.
>> >> >>
>> >> >> I don't think this will ever be a big memory hog, but it is worth
>> >> >> fixing.
>> >> >>
>> >> >
>> >> > I've spent a small bit of time at this again this morning, and
>> >> > everything seems to come back down to the issue of having a numpy
>> >> > array of grid objects.  If I switch this to a list, the reference
>> >> > counting is correct again and things get deallocated properly.  I've
>> >> > tried a number of ways of changing how they're allocated, but none
>> >> > seem to work for getting the refcount correct.  Oddly enough, if I
>> >> > track both a list *and* an array (i.e., set self._grids =
>> >> > self.grids.tolist()) then the refcounting is correct.
>> >> >
>> >> > I'm sure there's an explanation for this, but I don't know it.  It
>> >> > looks to me like numpy thinks it owns the data and that it should
>> >> > decrement the object refcount.
>> >> >
>> >> > By adding this line:
>> >> >
>> >> > self._grids = self.grids.tolist()
>> >> >
>> >> > after the call to _populate_grid_objects() in grid_geometry_handler,
>> >> > I
>> >> > was able to get all references tracked and removed.
>> >> >
>> >> >>>
>> >> >>>
>> >> >>> On Tue, Jun 10, 2014 at 8:47 PM, Matthew Turk
>> >> >>> <matthewturk at gmail.com>
>> >> >>> wrote:
>> >> >>>>
>> >> >>>> Hi Nathan,
>> >> >>>>
>> >> >>>> I believe there are two things at work here.
>> >> >>>>
>> >> >>>> 1) (I do not have high confidence of this one.)  YTArrays that are
>> >> >>>> referenced with a .d and turned into numpy arrays which no longer
>> >> >>>> *own
>> >> >>>> the data* may be retaining a reference, but that reference doesn't
>> >> >>>> get
>> >> >>>> freed later.  This happens often when we are doing things in the
>> >> >>>> hierarchy instantiation phase.  I haven't been able to figure out
>> >> >>>> which references get lost; for me, over 40 outputs, I lost 1560.
>> >> >>>> I
>> >> >>>> think it's 39 YTArrays per hierarchy.  This might also be related
>> >> >>>> to
>> >> >>>> field detection.  I think this is not a substantial contributor.
>> >> >>>> 2) For some reason, when the .grids attribute (object array) is
>> >> >>>> deleted on an index, the refcounts of those grids don't decrease.
>> >> >>>> I
>> >> >>>> am able to decrease their refcounts by manually setting
>> >> >>>> pf.index.grids[:] = None.  This eliminated all retained grid
>> >> >>>> references.
>> >> >>>>
>> >> >>>> So, I think the root is that at some point, because of circular
>> >> >>>> references or whatever, the finalizer isn't being called on the
>> >> >>>> Gridndex (or on Index itself).  This results in the reference to
>> >> >>>> the
>> >> >>>> grids array being kept, which then pumps up the lost object count.
>> >> >>>> I
>> >> >>>> don't know why it's not getting called (it's not guaranteed to be
>> >> >>>> called, in any event).
>> >> >>>>
>> >> >>>> I have to take care of some other things (including Brendan's note
>> >> >>>> about the memory problems with particle datasets) but I am pretty
>> >> >>>> sure
>> >> >>>> this is the root.
>> >> >>>>
>> >> >>>> -Matt
>> >> >>>>
>> >> >>>> On Tue, Jun 10, 2014 at 10:13 PM, Matthew Turk
>> >> >>>> <matthewturk at gmail.com>
>> >> >>>> wrote:
>> >> >>>> > Hi Nathan,
>> >> >>>> >
>> >> >>>> > All it requires is a call to .index; you don't need to do
>> >> >>>> > anything
>> >> >>>> > else to get it to lose references.
>> >> >>>> >
>> >> >>>> > I'm still looking into it.
>> >> >>>> >
>> >> >>>> > -Matt
>> >> >>>> >
>> >> >>>> > On Tue, Jun 10, 2014 at 9:26 PM, Nathan Goldbaum
>> >> >>>> > <nathan12343 at gmail.com>
>> >> >>>> > wrote:
>> >> >>>> >>
>> >> >>>> >>
>> >> >>>> >>
>> >> >>>> >> On Tue, Jun 10, 2014 at 10:59 AM, Matthew Turk
>> >> >>>> >> <matthewturk at gmail.com>
>> >> >>>> >> wrote:
>> >> >>>> >>>
>> >> >>>> >>> Do you have a reproducible script?
>> >> >>>> >>
>> >> >>>> >>
>> >> >>>> >> This should do the trick:
>> >> >>>> >> http://paste.yt-project.org/show/4767/
>> >> >>>> >>
>> >> >>>> >> (this is with an enzo dataset by the way)
>> >> >>>> >>
>> >> >>>> >> That script prints (on my machine):
>> >> >>>> >>
>> >> >>>> >> EnzoGrid    15065
>> >> >>>> >> YTArray     1520
>> >> >>>> >> list        704
>> >> >>>> >> dict        2
>> >> >>>> >> MaskedArray 1
>> >> >>>> >>
>> >> >>>> >> Which indicates that 15000 EnzoGrid objects and 1520 YTArray
>> >> >>>> >> objects
>> >> >>>> >> have
>> >> >>>> >> leaked.
>> >> >>>> >>
>> >> >>>> >> The list I'm printing out at the end of the script should be
>> >> >>>> >> the
>> >> >>>> >> objects
>> >> >>>> >> that leaked during the loop over the Enzo dataset.  The
>> >> >>>> >> objgraph.get_leaking_objects() function returns the list of all
>> >> >>>> >> objects
>> >> >>>> >> being tracked by the garbage collector that have no references
>> >> >>>> >> but
>> >> >>>> >> still
>> >> >>>> >> have nonzero refcounts.
>> >> >>>> >>
>> >> >>>> >> This means the "original_leaks" list isn't necessarily a list
>> >> >>>> >> of
>> >> >>>> >> leaky
>> >> >>>> >> objects - most of the things in there are singletons that the
>> >> >>>> >> interpreter
>> >> >>>> >> keeps around. To create a list of leaky objects produced by
>> >> >>>> >> iterating
>> >> >>>> >> over
>> >> >>>> >> the loop I take the set difference of the output of
>> >> >>>> >> get_leaking_objects()
>> >> >>>> >> before and after iterating over the dataset.
>> >> >>>> >>
>> >> >>>> >>>
>> >> >>>> >>> If you make a bunch of symlinks to
>> >> >>>> >>> one flash file and load them all in sequence, does that
>> >> >>>> >>> replicate
>> >> >>>> >>> the
>> >> >>>> >>> behavior?
>> >> >>>> >>
>> >> >>>> >>
>> >> >>>> >> Yes, it seems to.  Compare the output of this script:
>> >> >>>> >> http://paste.yt-project.org/show/4768/
>> >> >>>> >>
>> >> >>>> >> Adjust the range of the for loop from 0 to 5 - creating the
>> >> >>>> >> needed
>> >> >>>> >> symlinks
>> >> >>>> >> to WindTunnel/windtunnel_4lev_hdf5_plt_cnt_0040 as needed.
>> >> >>>> >>
>> >> >>>> >>>
>> >> >>>> >>>
>> >> >>>> >>> On Tue, Jun 10, 2014 at 12:57 PM, Nathan Goldbaum
>> >> >>>> >>> <nathan12343 at gmail.com>
>> >> >>>> >>> wrote:
>> >> >>>> >>> >
>> >> >>>> >>> >
>> >> >>>> >>> >
>> >> >>>> >>> > On Tue, Jun 10, 2014 at 10:45 AM, Matthew Turk
>> >> >>>> >>> > <matthewturk at gmail.com>
>> >> >>>> >>> > wrote:
>> >> >>>> >>> >>
>> >> >>>> >>> >> Hi Nathan,
>> >> >>>> >>> >>
>> >> >>>> >>> >> On Tue, Jun 10, 2014 at 12:43 PM, Nathan Goldbaum
>> >> >>>> >>> >> <nathan12343 at gmail.com>
>> >> >>>> >>> >> wrote:
>> >> >>>> >>> >> >
>> >> >>>> >>> >> >
>> >> >>>> >>> >> >
>> >> >>>> >>> >> > On Tue, Jun 10, 2014 at 6:09 AM, Matthew Turk
>> >> >>>> >>> >> > <matthewturk at gmail.com>
>> >> >>>> >>> >> > wrote:
>> >> >>>> >>> >> >>
>> >> >>>> >>> >> >> Hi Nathan,
>> >> >>>> >>> >> >>
>> >> >>>> >>> >> >> On Mon, Jun 9, 2014 at 11:02 PM, Nathan Goldbaum
>> >> >>>> >>> >> >> <nathan12343 at gmail.com>
>> >> >>>> >>> >> >> wrote:
>> >> >>>> >>> >> >> > Hey all,
>> >> >>>> >>> >> >> >
>> >> >>>> >>> >> >> > I'm looking at a memory leak that Philip (cc'd) is
>> >> >>>> >>> >> >> > seeing
>> >> >>>> >>> >> >> > when
>> >> >>>> >>> >> >> > iterating
>> >> >>>> >>> >> >> > over a long list of FLASH datasets.  Just as an
>> >> >>>> >>> >> >> > example
>> >> >>>> >>> >> >> > of the
>> >> >>>> >>> >> >> > type
>> >> >>>> >>> >> >> > of
>> >> >>>> >>> >> >> > behavior he is seeing - today he left his script
>> >> >>>> >>> >> >> > running
>> >> >>>> >>> >> >> > and
>> >> >>>> >>> >> >> > ended
>> >> >>>> >>> >> >> > up
>> >> >>>> >>> >> >> > consuming 300 GB of RAM on a viz node.
>> >> >>>> >>> >> >> >
>> >> >>>> >>> >> >> > FWIW, the dataset is not particularly large - ~300
>> >> >>>> >>> >> >> > outputs and
>> >> >>>> >>> >> >> > ~100
>> >> >>>> >>> >> >> > MB
>> >> >>>> >>> >> >> > per
>> >> >>>> >>> >> >> > output. These are also FLASH cylindrical coordinate
>> >> >>>> >>> >> >> > simulations -
>> >> >>>> >>> >> >> > so
>> >> >>>> >>> >> >> > perhaps
>> >> >>>> >>> >> >> > this behavior will only occur in curvilinear
>> >> >>>> >>> >> >> > geometries?
>> >> >>>> >>> >> >>
>> >> >>>> >>> >> >> Hm, I don't know about that.
>> >> >>>> >>> >> >>
>> >> >>>> >>> >> >> >
>> >> >>>> >>> >> >> > I've been playing with objgraph to try to understand
>> >> >>>> >>> >> >> > what's
>> >> >>>> >>> >> >> > happening.
>> >> >>>> >>> >> >> > Here's the script I've been using:
>> >> >>>> >>> >> >> > http://paste.yt-project.org/show/4762/
>> >> >>>> >>> >> >> >
>> >> >>>> >>> >> >> > Here's the output after one iteration of the for loop:
>> >> >>>> >>> >> >> > http://paste.yt-project.org/show/4761/
>> >> >>>> >>> >> >> >
>> >> >>>> >>> >> >> > It seems that for some reason a lot of data is not
>> >> >>>> >>> >> >> > being
>> >> >>>> >>> >> >> > garbage
>> >> >>>> >>> >> >> > collected.
>> >> >>>> >>> >> >> >
>> >> >>>> >>> >> >> > Could there be a reference counting bug somewhere down
>> >> >>>> >>> >> >> > in
>> >> >>>> >>> >> >> > a
>> >> >>>> >>> >> >> > cython
>> >> >>>> >>> >> >> > routine?
>> >> >>>> >>> >> >>
>> >> >>>> >>> >> >> Based on what you're running, the only Cython routines
>> >> >>>> >>> >> >> being
>> >> >>>> >>> >> >> called
>> >> >>>> >>> >> >> are likely in the selection system.
>> >> >>>> >>> >> >>
>> >> >>>> >>> >> >> > Objgraph is unable to find backreferences to root grid
>> >> >>>> >>> >> >> > tiles
>> >> >>>> >>> >> >> > in
>> >> >>>> >>> >> >> > the
>> >> >>>> >>> >> >> > flash
>> >> >>>> >>> >> >> > dataset, and all the other yt objects that I've looked
>> >> >>>> >>> >> >> > at
>> >> >>>> >>> >> >> > seem
>> >> >>>> >>> >> >> > to
>> >> >>>> >>> >> >> > have
>> >> >>>> >>> >> >> > backreference graphs that terminate at a FLASHGrid
>> >> >>>> >>> >> >> > object
>> >> >>>> >>> >> >> > that
>> >> >>>> >>> >> >> > represents a
>> >> >>>> >>> >> >> > root grid tile in one of the datasets.  That's the
>> >> >>>> >>> >> >> > best
>> >> >>>> >>> >> >> > guess
>> >> >>>> >>> >> >> > I
>> >> >>>> >>> >> >> > have
>> >> >>>> >>> >> >> > -
>> >> >>>> >>> >> >> > but
>> >> >>>> >>> >> >> > definitely nothing conclusive.  I'd appreciate any
>> >> >>>> >>> >> >> > other
>> >> >>>> >>> >> >> > ideas
>> >> >>>> >>> >> >> > anyone
>> >> >>>> >>> >> >> > else
>> >> >>>> >>> >> >> > has to help debug this.
>> >> >>>> >>> >> >>
>> >> >>>> >>> >> >> I'm not entirely sure how to parse the output you've
>> >> >>>> >>> >> >> pasted, but
>> >> >>>> >>> >> >> I
>> >> >>>> >>> >> >> do
>> >> >>>> >>> >> >> have a thought.  If you have a reproducible case, I can
>> >> >>>> >>> >> >> test it
>> >> >>>> >>> >> >> myself.  I am wondering if this could be related to the
>> >> >>>> >>> >> >> way
>> >> >>>> >>> >> >> that
>> >> >>>> >>> >> >> grid
>> >> >>>> >>> >> >> masks are cached.  You should be able to test this by
>> >> >>>> >>> >> >> adding
>> >> >>>> >>> >> >> this
>> >> >>>> >>> >> >> line
>> >> >>>> >>> >> >> to _get_selector_mask in grid_patch.py, just before
>> >> >>>> >>> >> >> "return
>> >> >>>> >>> >> >> mask"
>> >> >>>> >>> >> >>
>> >> >>>> >>> >> >> self._last_mask = self._last_selector_id = None
>> >> >>>> >>> >> >>
>> >> >>>> >>> >> >> Something like this patch:
>> >> >>>> >>> >> >>
>> >> >>>> >>> >> >> http://paste.yt-project.org/show/4316/
>> >> >>>> >>> >> >
>> >> >>>> >>> >> >
>> >> >>>> >>> >> > Thanks for the code!  I will look into this today.
>> >> >>>> >>> >> >
>> >> >>>> >>> >> > Sorry for not explaining the random terminal output I
>> >> >>>> >>> >> > pasted
>> >> >>>> >>> >> > from
>> >> >>>> >>> >> > objgraph
>> >> >>>> >>> >> > :/
>> >> >>>> >>> >> >
>> >> >>>> >>> >> > It's a list of objects created after yt operates on one
>> >> >>>> >>> >> > dataset
>> >> >>>> >>> >> > and
>> >> >>>> >>> >> > after
>> >> >>>> >>> >> > the garbage collector is explicitly called. Each
>> >> >>>> >>> >> > iteration
>> >> >>>> >>> >> > of the
>> >> >>>> >>> >> > loop
>> >> >>>> >>> >> > sees
>> >> >>>> >>> >> > the creation of objects representing the FLASH grids,
>> >> >>>> >>> >> > hierarchy,
>> >> >>>> >>> >> > and
>> >> >>>> >>> >> > associated metadata.  With enough iterations this
>> >> >>>> >>> >> > overhead
>> >> >>>> >>> >> > from
>> >> >>>> >>> >> > previous
>> >> >>>> >>> >> > loop iterations begins to dominate the total memory
>> >> >>>> >>> >> > budget.
>> >> >>>> >>> >>
>> >> >>>> >>> >> The code snippet I sent might help reduce it, but I think
>> >> >>>> >>> >> it
>> >> >>>> >>> >> speaks
>> >> >>>> >>> >> to
>> >> >>>> >>> >> a deeper problem in that somehow the FLASH stuff isn't
>> >> >>>> >>> >> being
>> >> >>>> >>> >> GC'd
>> >> >>>> >>> >> anywhere.  It really ought to be.
>> >> >>>> >>> >>
>> >> >>>> >>> >> Can you try also doing:
>> >> >>>> >>> >>
>> >> >>>> >>> >> yt.frontends.flash.FLASHDataset._skip_cache = True
>> >> >>>> >>> >
>> >> >>>> >>> >
>> >> >>>> >>> > No effect, unfortunately.
>> >> >>>> >>> >
>> >> >>>> >>> >>
>> >> >>>> >>> >> and seeing if that helps?
>> >> >>>> >>> >>
>> >> >>>> >>> >> >
>> >> >>>> >>> >> >>
>> >> >>>> >>> >> >>
>> >> >>>> >>> >> >>
>> >> >>>> >>> >> >> -Matt
>> >> >>>> >>> >> >>
>> >> >>>> >>> >> >> >
>> >> >>>> >>> >> >> > Thanks for your help in debugging this!
>> >> >>>> >>> >> >> >
>> >> >>>> >>> >> >> > -Nathan
>> >> >>>> >>> >> >> >
>> >> >>>> >>> >> >> _______________________________________________
>> >> >>>> >>> >> >> yt-dev mailing list
>> >> >>>> >>> >> >> yt-dev at lists.spacepope.org
>> >> >>>> >>> >> >>
>> >> >>>> >>> >> >>
>> >> >>>> >>> >> >> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>> >> >>>> >>> >> >
>> >> >>>> >>> >> >
>> >> >>>> >>> >> >
>> >> >>>> >>> >> > _______________________________________________
>> >> >>>> >>> >> > yt-dev mailing list
>> >> >>>> >>> >> > yt-dev at lists.spacepope.org
>> >> >>>> >>> >> >
>> >> >>>> >>> >> > http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>> >> >>>> >>> >> >
>> >> >>>> >>> >> _______________________________________________
>> >> >>>> >>> >> yt-dev mailing list
>> >> >>>> >>> >> yt-dev at lists.spacepope.org
>> >> >>>> >>> >>
>> >> >>>> >>> >> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>> >> >>>> >>> >
>> >> >>>> >>> >
>> >> >>>> >>> >
>> >> >>>> >>> > _______________________________________________
>> >> >>>> >>> > yt-dev mailing list
>> >> >>>> >>> > yt-dev at lists.spacepope.org
>> >> >>>> >>> > http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>> >> >>>> >>> >
>> >> >>>> >>> _______________________________________________
>> >> >>>> >>> yt-dev mailing list
>> >> >>>> >>> yt-dev at lists.spacepope.org
>> >> >>>> >>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>> >> >>>> >>
>> >> >>>> >>
>> >> >>>> >>
>> >> >>>> >> _______________________________________________
>> >> >>>> >> yt-dev mailing list
>> >> >>>> >> yt-dev at lists.spacepope.org
>> >> >>>> >> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>> >> >>>> >>
>> >> >>>> _______________________________________________
>> >> >>>> yt-dev mailing list
>> >> >>>> yt-dev at lists.spacepope.org
>> >> >>>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> _______________________________________________
>> >> >>> yt-dev mailing list
>> >> >>> yt-dev at lists.spacepope.org
>> >> >>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>> >> >>>
>> >> _______________________________________________
>> >> yt-dev mailing list
>> >> yt-dev at lists.spacepope.org
>> >> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>> >
>> >
>> >
>> > _______________________________________________
>> > yt-dev mailing list
>> > yt-dev at lists.spacepope.org
>> > http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>> >
>> _______________________________________________
>> yt-dev mailing list
>> yt-dev at lists.spacepope.org
>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>
>
>
> _______________________________________________
> yt-dev mailing list
> yt-dev at lists.spacepope.org
> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>



More information about the yt-dev mailing list