[yt-dev] Zombie jobs on eudora?

Kacper Kowalik xarthisius.kk at gmail.com
Wed Jun 11 12:42:10 PDT 2014


On 11.06.2014 21:26, Nathan Goldbaum wrote:
> I can confirm that creating the list alias to .grids does eliminate the
> Grid objects from showing up in the objgraph output.
> 
> That said, I'm still seeing steadily increasing memory usage when I iterate
> over a bunch of datasets (http://paste.yt-project.org/show/4773/), creating
> SlicePlots for each one.  I'm not sure yet where the memory is going, just
> that objgraph can't see it.

Hi Nathan,
if you're iterating over big files, make sure it's not the buffer/cache.
Cheers,
Kacper

> 
> 
> On Wed, Jun 11, 2014 at 12:02 PM, Matthew Turk <matthewturk at gmail.com>
> wrote:
> 
>> On Wed, Jun 11, 2014 at 1:58 PM, Nathan Goldbaum <nathan12343 at gmail.com>
>> wrote:
>>> Could this issue be related?
>>>
>>> https://github.com/numpy/numpy/issues/1601
>>
>> Yeah, that's the one.
>>
>>>
>>> Can you elaborate a bit more about why we're using an object array in the
>>> first place?  If switching to using a list solves these issues perhaps
>> that
>>> is the way to go.
>>
>> Two reasons.  One is that it's an OOM faster for some things that we
>> do a lot, and the other is that it makes it much easier to index.  We
>> can do things like selection based on indices or booleans this way.
>> But we don't do that very often anymore.
>>
>> I don't want to switch it to a list; that's a nasty bandaid that
>> breaks things.  We can just add on an additional list, which is a
>> nasty bandaid that doesn't break things.  I think the memory overhead
>> will be minimal for that.  Really fixing it will require moving away
>> from arrays completely, which we can slot in for 3.1.
>>
>>>
>>>
>>> On Wed, Jun 11, 2014 at 7:25 AM, Matthew Turk <matthewturk at gmail.com>
>> wrote:
>>>>
>>>> I should also note, at some point in the future I want to get rid of
>>>> the object arrays for grids, but that timescale is longer.  Using
>>>> John's grid tree is a much better approach.
>>>>
>>>> On Wed, Jun 11, 2014 at 9:21 AM, Matthew Turk <matthewturk at gmail.com>
>>>> wrote:
>>>>> On Tue, Jun 10, 2014 at 10:53 PM, Matthew Turk <matthewturk at gmail.com
>>>
>>>>> wrote:
>>>>>> On Tue, Jun 10, 2014 at 10:50 PM, Nathan Goldbaum
>>>>>> <nathan12343 at gmail.com> wrote:
>>>>>>> For the leaking YTArrays, Kacper suggested the following patch on
>> IRC:
>>>>>>>
>>>>>>> http://bpaste.net/show/361120/
>>>>>>>
>>>>>>> This works for FLASH but seems to break field detection for enzo.
>>>>>>
>>>>>> I don't think this will ever be a big memory hog, but it is worth
>>>>>> fixing.
>>>>>>
>>>>>
>>>>> I've spent a small bit of time at this again this morning, and
>>>>> everything seems to come back down to the issue of having a numpy
>>>>> array of grid objects.  If I switch this to a list, the reference
>>>>> counting is correct again and things get deallocated properly.  I've
>>>>> tried a number of ways of changing how they're allocated, but none
>>>>> seem to work for getting the refcount correct.  Oddly enough, if I
>>>>> track both a list *and* an array (i.e., set self._grids =
>>>>> self.grids.tolist()) then the refcounting is correct.
>>>>>
>>>>> I'm sure there's an explanation for this, but I don't know it.  It
>>>>> looks to me like numpy thinks it owns the data and that it should
>>>>> decrement the object refcount.
>>>>>
>>>>> By adding this line:
>>>>>
>>>>> self._grids = self.grids.tolist()
>>>>>
>>>>> after the call to _populate_grid_objects() in grid_geometry_handler, I
>>>>> was able to get all references tracked and removed.
>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jun 10, 2014 at 8:47 PM, Matthew Turk <
>> matthewturk at gmail.com>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi Nathan,
>>>>>>>>
>>>>>>>> I believe there are two things at work here.
>>>>>>>>
>>>>>>>> 1) (I do not have high confidence of this one.)  YTArrays that are
>>>>>>>> referenced with a .d and turned into numpy arrays which no longer
>>>>>>>> *own
>>>>>>>> the data* may be retaining a reference, but that reference doesn't
>>>>>>>> get
>>>>>>>> freed later.  This happens often when we are doing things in the
>>>>>>>> hierarchy instantiation phase.  I haven't been able to figure out
>>>>>>>> which references get lost; for me, over 40 outputs, I lost 1560.  I
>>>>>>>> think it's 39 YTArrays per hierarchy.  This might also be related
>> to
>>>>>>>> field detection.  I think this is not a substantial contributor.
>>>>>>>> 2) For some reason, when the .grids attribute (object array) is
>>>>>>>> deleted on an index, the refcounts of those grids don't decrease.
>>  I
>>>>>>>> am able to decrease their refcounts by manually setting
>>>>>>>> pf.index.grids[:] = None.  This eliminated all retained grid
>>>>>>>> references.
>>>>>>>>
>>>>>>>> So, I think the root is that at some point, because of circular
>>>>>>>> references or whatever, the finalizer isn't being called on the
>>>>>>>> Gridndex (or on Index itself).  This results in the reference to
>> the
>>>>>>>> grids array being kept, which then pumps up the lost object count.
>>  I
>>>>>>>> don't know why it's not getting called (it's not guaranteed to be
>>>>>>>> called, in any event).
>>>>>>>>
>>>>>>>> I have to take care of some other things (including Brendan's note
>>>>>>>> about the memory problems with particle datasets) but I am pretty
>>>>>>>> sure
>>>>>>>> this is the root.
>>>>>>>>
>>>>>>>> -Matt
>>>>>>>>
>>>>>>>> On Tue, Jun 10, 2014 at 10:13 PM, Matthew Turk
>>>>>>>> <matthewturk at gmail.com>
>>>>>>>> wrote:
>>>>>>>>> Hi Nathan,
>>>>>>>>>
>>>>>>>>> All it requires is a call to .index; you don't need to do
>> anything
>>>>>>>>> else to get it to lose references.
>>>>>>>>>
>>>>>>>>> I'm still looking into it.
>>>>>>>>>
>>>>>>>>> -Matt
>>>>>>>>>
>>>>>>>>> On Tue, Jun 10, 2014 at 9:26 PM, Nathan Goldbaum
>>>>>>>>> <nathan12343 at gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Jun 10, 2014 at 10:59 AM, Matthew Turk
>>>>>>>>>> <matthewturk at gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Do you have a reproducible script?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This should do the trick:
>> http://paste.yt-project.org/show/4767/
>>>>>>>>>>
>>>>>>>>>> (this is with an enzo dataset by the way)
>>>>>>>>>>
>>>>>>>>>> That script prints (on my machine):
>>>>>>>>>>
>>>>>>>>>> EnzoGrid    15065
>>>>>>>>>> YTArray     1520
>>>>>>>>>> list        704
>>>>>>>>>> dict        2
>>>>>>>>>> MaskedArray 1
>>>>>>>>>>
>>>>>>>>>> Which indicates that 15000 EnzoGrid objects and 1520 YTArray
>>>>>>>>>> objects
>>>>>>>>>> have
>>>>>>>>>> leaked.
>>>>>>>>>>
>>>>>>>>>> The list I'm printing out at the end of the script should be the
>>>>>>>>>> objects
>>>>>>>>>> that leaked during the loop over the Enzo dataset.  The
>>>>>>>>>> objgraph.get_leaking_objects() function returns the list of all
>>>>>>>>>> objects
>>>>>>>>>> being tracked by the garbage collector that have no references
>> but
>>>>>>>>>> still
>>>>>>>>>> have nonzero refcounts.
>>>>>>>>>>
>>>>>>>>>> This means the "original_leaks" list isn't necessarily a list of
>>>>>>>>>> leaky
>>>>>>>>>> objects - most of the things in there are singletons that the
>>>>>>>>>> interpreter
>>>>>>>>>> keeps around. To create a list of leaky objects produced by
>>>>>>>>>> iterating
>>>>>>>>>> over
>>>>>>>>>> the loop I take the set difference of the output of
>>>>>>>>>> get_leaking_objects()
>>>>>>>>>> before and after iterating over the dataset.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> If you make a bunch of symlinks to
>>>>>>>>>>> one flash file and load them all in sequence, does that
>> replicate
>>>>>>>>>>> the
>>>>>>>>>>> behavior?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Yes, it seems to.  Compare the output of this script:
>>>>>>>>>> http://paste.yt-project.org/show/4768/
>>>>>>>>>>
>>>>>>>>>> Adjust the range of the for loop from 0 to 5 - creating the
>> needed
>>>>>>>>>> symlinks
>>>>>>>>>> to WindTunnel/windtunnel_4lev_hdf5_plt_cnt_0040 as needed.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jun 10, 2014 at 12:57 PM, Nathan Goldbaum
>>>>>>>>>>> <nathan12343 at gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jun 10, 2014 at 10:45 AM, Matthew Turk
>>>>>>>>>>>> <matthewturk at gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Nathan,
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Jun 10, 2014 at 12:43 PM, Nathan Goldbaum
>>>>>>>>>>>>> <nathan12343 at gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Jun 10, 2014 at 6:09 AM, Matthew Turk
>>>>>>>>>>>>>> <matthewturk at gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Nathan,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Jun 9, 2014 at 11:02 PM, Nathan Goldbaum
>>>>>>>>>>>>>>> <nathan12343 at gmail.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> Hey all,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm looking at a memory leak that Philip (cc'd) is
>> seeing
>>>>>>>>>>>>>>>> when
>>>>>>>>>>>>>>>> iterating
>>>>>>>>>>>>>>>> over a long list of FLASH datasets.  Just as an example
>>>>>>>>>>>>>>>> of the
>>>>>>>>>>>>>>>> type
>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>> behavior he is seeing - today he left his script
>> running
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> ended
>>>>>>>>>>>>>>>> up
>>>>>>>>>>>>>>>> consuming 300 GB of RAM on a viz node.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> FWIW, the dataset is not particularly large - ~300
>>>>>>>>>>>>>>>> outputs and
>>>>>>>>>>>>>>>> ~100
>>>>>>>>>>>>>>>> MB
>>>>>>>>>>>>>>>> per
>>>>>>>>>>>>>>>> output. These are also FLASH cylindrical coordinate
>>>>>>>>>>>>>>>> simulations -
>>>>>>>>>>>>>>>> so
>>>>>>>>>>>>>>>> perhaps
>>>>>>>>>>>>>>>> this behavior will only occur in curvilinear
>> geometries?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hm, I don't know about that.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I've been playing with objgraph to try to understand
>>>>>>>>>>>>>>>> what's
>>>>>>>>>>>>>>>> happening.
>>>>>>>>>>>>>>>> Here's the script I've been using:
>>>>>>>>>>>>>>>> http://paste.yt-project.org/show/4762/
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Here's the output after one iteration of the for loop:
>>>>>>>>>>>>>>>> http://paste.yt-project.org/show/4761/
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It seems that for some reason a lot of data is not
>> being
>>>>>>>>>>>>>>>> garbage
>>>>>>>>>>>>>>>> collected.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Could there be a reference counting bug somewhere down
>> in
>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>> cython
>>>>>>>>>>>>>>>> routine?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Based on what you're running, the only Cython routines
>>>>>>>>>>>>>>> being
>>>>>>>>>>>>>>> called
>>>>>>>>>>>>>>> are likely in the selection system.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Objgraph is unable to find backreferences to root grid
>>>>>>>>>>>>>>>> tiles
>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> flash
>>>>>>>>>>>>>>>> dataset, and all the other yt objects that I've looked
>> at
>>>>>>>>>>>>>>>> seem
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>> backreference graphs that terminate at a FLASHGrid
>> object
>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>> represents a
>>>>>>>>>>>>>>>> root grid tile in one of the datasets.  That's the best
>>>>>>>>>>>>>>>> guess
>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>> definitely nothing conclusive.  I'd appreciate any
>> other
>>>>>>>>>>>>>>>> ideas
>>>>>>>>>>>>>>>> anyone
>>>>>>>>>>>>>>>> else
>>>>>>>>>>>>>>>> has to help debug this.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm not entirely sure how to parse the output you've
>>>>>>>>>>>>>>> pasted, but
>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>> have a thought.  If you have a reproducible case, I can
>>>>>>>>>>>>>>> test it
>>>>>>>>>>>>>>> myself.  I am wondering if this could be related to the
>> way
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>> grid
>>>>>>>>>>>>>>> masks are cached.  You should be able to test this by
>>>>>>>>>>>>>>> adding
>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>> line
>>>>>>>>>>>>>>> to _get_selector_mask in grid_patch.py, just before
>> "return
>>>>>>>>>>>>>>> mask"
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> self._last_mask = self._last_selector_id = None
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Something like this patch:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> http://paste.yt-project.org/show/4316/
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for the code!  I will look into this today.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Sorry for not explaining the random terminal output I
>> pasted
>>>>>>>>>>>>>> from
>>>>>>>>>>>>>> objgraph
>>>>>>>>>>>>>> :/
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It's a list of objects created after yt operates on one
>>>>>>>>>>>>>> dataset
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> after
>>>>>>>>>>>>>> the garbage collector is explicitly called. Each iteration
>>>>>>>>>>>>>> of the
>>>>>>>>>>>>>> loop
>>>>>>>>>>>>>> sees
>>>>>>>>>>>>>> the creation of objects representing the FLASH grids,
>>>>>>>>>>>>>> hierarchy,
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> associated metadata.  With enough iterations this overhead
>>>>>>>>>>>>>> from
>>>>>>>>>>>>>> previous
>>>>>>>>>>>>>> loop iterations begins to dominate the total memory
>> budget.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The code snippet I sent might help reduce it, but I think it
>>>>>>>>>>>>> speaks
>>>>>>>>>>>>> to
>>>>>>>>>>>>> a deeper problem in that somehow the FLASH stuff isn't being
>>>>>>>>>>>>> GC'd
>>>>>>>>>>>>> anywhere.  It really ought to be.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Can you try also doing:
>>>>>>>>>>>>>
>>>>>>>>>>>>> yt.frontends.flash.FLASHDataset._skip_cache = True
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> No effect, unfortunately.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> and seeing if that helps?
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -Matt
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for your help in debugging this!
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -Nathan
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> yt-dev mailing list
>>>>>>>>>>>>>>> yt-dev at lists.spacepope.org
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> yt-dev mailing list
>>>>>>>>>>>>>> yt-dev at lists.spacepope.org
>>>>>>>>>>>>>>
>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> yt-dev mailing list
>>>>>>>>>>>>> yt-dev at lists.spacepope.org
>>>>>>>>>>>>>
>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> yt-dev mailing list
>>>>>>>>>>>> yt-dev at lists.spacepope.org
>>>>>>>>>>>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> yt-dev mailing list
>>>>>>>>>>> yt-dev at lists.spacepope.org
>>>>>>>>>>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> yt-dev mailing list
>>>>>>>>>> yt-dev at lists.spacepope.org
>>>>>>>>>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> yt-dev mailing list
>>>>>>>> yt-dev at lists.spacepope.org
>>>>>>>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> yt-dev mailing list
>>>>>>> yt-dev at lists.spacepope.org
>>>>>>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>>>>>>
>>>> _______________________________________________
>>>> yt-dev mailing list
>>>> yt-dev at lists.spacepope.org
>>>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>>
>>>
>>>
>>> _______________________________________________
>>> yt-dev mailing list
>>> yt-dev at lists.spacepope.org
>>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>>
>> _______________________________________________
>> yt-dev mailing list
>> yt-dev at lists.spacepope.org
>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>
> 
> 
> 
> _______________________________________________
> yt-dev mailing list
> yt-dev at lists.spacepope.org
> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
> 


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 901 bytes
Desc: OpenPGP digital signature
URL: <http://lists.spacepope.org/pipermail/yt-dev-spacepope.org/attachments/20140611/6b47905a/attachment.sig>


More information about the yt-dev mailing list