[Yt-dev] RMS mass overdensity method

Wed Mar 4 08:11:47 PST 2009

Matt,

> http://paste.enzotools.org/show/67/
> 
> It should be useful in your case.

As you sent it to me doesn't work. As you have it, I don't get any error messages at all. If I take away the first Barrier() I will get error messages and proc IDs (I added >> stderr to the prints b/c Ranger separates stdout and err), but they're still all jumbled up together.

> What I would try, to separate out the memory issues, would be manually
> cleaning each grid after you do, say, 10 sphere:
> 
> for g in pf.h.grids: g.clear_data()

This does nothing, as far as I can tell, to help this problem.

I started looking at when it was crashing, and the answer is that it crashes inconsistently. I mean that if I look at the work cycle by proc, it isn't crashing at the same place every time. I found a spot that it was at least consistently reaching and passing before crashing, and modified my code such that it runs in parallel up to that point, and in 'serial' after that, see below starting at line 44. In terms of samples, which is what i loops over, in strict parallel it generally crashes between 60 and 80. By this I mean, if I "grep 'sample NNNNNN' error.log | wc", between 60 and 80 is the highest number that has 64 copies. Some threads make it far past 80 before dying.

http://paste.enzotools.org/show/69/

To cut off the details here, I'm sorry about that above, in the parallel->serial mode, it runs significantly longer, I got it up to 127 samples before it crashes. However, when it crashed in 'serial' mode, it gave me no error traceback at all, and I don't have the modified traceback turned on. So this may be a different crash entirely. I ssh-ed into one of the nodes during this kind of run, and the memory usage wasn't growing once it entered the spheres phase as far as I could tell.

I ran my code in parallel on my small test dataset and increased the samples high enough such that each task would do many more samples than I'm trying on each L7 task, and that didn't crash.

I tried calling my code many times in a row with small sample sizes on L7, and somewhat encouragingly it crashes around the same time in terms of total samples as with a single high-sample call.

I haven't tried doing this in strict serial yet on L7 because that would take a very long time. Just reading the data would take several hours.

I'm about to try running this code on a different L7 snapshot, to see if it's related to a problem in the particular snapshot I've been working with. I'm not optimistic.

Any other suggestions of what I could try?

Thanks!

 _______________________________________________________
sskory at physics.ucsd.edu           o__  Stephen Skory
http://physics.ucsd.edu/~sskory/ _.>/ _Graduate Student
________________________________(_)_\(_)_______________