[Yt-dev] quick question on particle IO

Wed Oct 19 06:16:17 PDT 2011

Hi Stephen,

I spent a bit of time looking into pHOP's memory usage, and I think
there are some obvious places it could be improved.  Much of this is
due to your usage of the halo objects and structures I wrote, a long
time ago; I wonder if in a version 2.0 of pHOP this could be
jettisoned (as they were not designed to be anything other than
"something that worked" and they manifestly no longer do) and a new
system that reflects our new understanding could be used.

The places where I see memory jump that I think can be avoided:

 * Copying fields, without removing them, from self._data_source in
__obtain_particles
 * Initializing ParallelHOPHaloFinder with copies (by dividing) of
position and mass fields
 * Copying position, mass into the fKD object without removing them
from the halo finder (is there any reason they can't simply be moved
in and removed from the halo finder, then copied back out?)
 * Rearrange = True is the default, which I believe copies the entire
kD-tree inside fKD?

It also looks like some of this is because the fKD tree requires
position to be (N,3) for memory access speed.

I've been running this on a test 256^3 dataset (recall 256^3 * 64 bits
* [posx,posy,posz,mass,index] = 0.625 gigs), inserting calls to
get_memory_usage() at various stages.  Initially the peak memory usage
of pHOP was about 4.5 gigs.  By inserting deletions of the
_data_source fields and removing the division step, I was able to
reduce the peak memory usage *before* construction of the kD-tree to
1.4 gigs.  Afterward, it went up to 2.8 gigs.  By changing rearrange =
False, the peak went up to 2.5 gigs instead.

I wasn't able to reduce memory usage by deleting (even as an
intermediate step) self.xpos, self.ypos, self.zpos, which then led me
to change how ParallelHOPHaloFinder was initialized, by instead
passing in the actual particle_fields dictionary.  During the __init__
function on ParallelHOPHaloFinder I then pop'ed these fields out when
setting self.xpos etc.  This reduced peak memory down to ~1.5 gigs
before density was calculated, at which point it hits 1.8 gigs.

Unfortunately, it's not entirely clear to me how we then copy back out
of the Forthon kD-tree at the right time; I don't quite now the inner
workings of pHOP.  But I think this is valid, as I don't believe at
any point information is lost.

I think a combination of reducing memory copies and reducing reliance
on my old, unnecessary object classes might be able to dramatically
improve pHOP.  Can we take a look together?  My (currently broken!)
patch is here, if it helps provide a starting point:

http://paste.yt-project.org/show/1875/

Thanks for any ideas,

Matt

On Wed, Oct 19, 2011 at 12:03 AM, Matthew Turk <matthewturk at gmail.com> wrote:
> Hi Geoffrey,
>
> Thank you *very* much for your detailed response!
>
> All of this sounds like memory errors.  I don't think it's a problem
> with Nautilus (although I personally experienced problems with the old
> GPFS filesystem on Nautilus, long ago.)
>
> I have a few followup questions for Stephen:
>
>  * Does parallel HOP still dynamically load balance?  To do so, does
> it conduct histograms across datasets (i.e., similar to how we
> subselect the particles for a region by striding over them) or does it
> load, evaluate, discard?
>  * What multiple of the total dataset memory size is necessary to
> p-HOP an ideally load balanced set of particles?
>  * Are there any points in the code where the root processor is used
> as a primary staging location, or where the arrays are duplicated in
> some large amount on the root processor?
>  * Are there any points where fields are duplicated?  What about
> fancy-indexing, or implicit copies?
>
> Do you think it is reasonable, on a large system, to halo find a
> dataset of this size?  Is it feasible to construct resource estimates
> for ideally-balanced datasets?
>
> Thanks for any ideas,
>
> Matt
>
> On Tue, Oct 18, 2011 at 11:50 PM, Geoffrey So <gsiisg at gmail.com> wrote:
>> Sorry for the fragmented pieces of info, I was trying to determine what the
>> problem is with one of the sys admin at Nautilus, so I'm not even sure yet
>> if it is YT's problem.
>> Symptoms:
>> paralleHF fails for the 3200 cube dataset, but not always at the same place,
>> which leads us to think this might be an memory issue.
>> 1) What are you attempting to do, precisely?
>> Currently I'm trying to run parallelHF on pieces of the subvolume since I've
>> found out the memory requirement of the whole dataset exceeds the machine's
>> available memory (Nautilus with 4TB shared memory).
>> 2) What type of data, and what size of data, are you applying this to?
>> I'm doing parallelHF with DM only on a piece of the subvolume that's 1/64th
>> of the original volume.
>> 3) What is the version of yt you are using (changeset hash)?
>> Was using the latest YT as of last week when I ran the unsuccessful runs,
>> currently trying Stephen's modification which should help with memory:
>> (dev-yt)Geoffreys-MacBook-Air:yt-hg gso$ hg identify
>> 2efcec06484e (yt) tip
>> I am going to modify my script and send it to the sys admin to run test on
>> the 800 cube first
>> I've been asked not to submit jobs of the 3200 because the last time I did
>> it, it brought half the machine to a standstill
>> 4) How are you launching yt?
>> I was launching it with 512 cores and 2TB of total memory, but they said to
>> try to decrease the mpi task count so I've also tried 256, 64, 32 but they
>> all failed after a while, a couple was doing fine during the parallelHF
>> phase but suddenly ended with:
>> MPI: MPI_COMM_WORLD rank 6 has terminated without calling MPI_Finalize()
>> MPI: aborting job
>> MPI: Received signal 9
>> 5) What is the memory available to each individual process?
>> I've usually launched the 3200 with 2TB of memory with varying mpi task
>> counts from 32 to 512.
>> 6) Under what circumstances does yt crash?
>> I've also had
>> P100 yt : [INFO     ] 2011-10-03 08:03:06,125 Getting field
>> particle_position_x from 112
>> MPI: MPI_COMM_WORLD rank 153 has terminated without calling MPI_Finalize()
>> MPI: aborting job
>> MPI: Received signal 9
>>
>>
>> asallocash failed: system error trying to write a message header - Broken
>> pipe
>> and with the same script
>> P180 yt : [INFO     ] 2011-10-03 15:12:01,898 Finished with binary hierarchy
>> reading
>> Traceback (most recent call last):
>>   File "regionPHOP.py", line 23, in <module>
>>     sv = pf.h.region([i * delta[0] + delta[0] / 2.0,
>>   File
>> "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/data_objects/static_output.py",
>> line 169, in hierarchy
>>     self, data_style=self.data_style)
>>   File
>> "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/frontends/enzo/data_structures.py",
>> line 162, in __init__
>>     AMRHierarchy.__init__(self, pf, data_style)
>>   File
>> "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/data_objects/hierarchy.py",
>> line 79, in __init__
>>     self._detect_fields()
>>   File
>> "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/frontends/enzo/data_structures.py",
>> line 405, in _detect_fields
>>     self.save_data(list(field_list),"/","DataFields",passthrough=True)
>>   File
>> "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/utilities/parallel_tools/parallel_analysis_interface.py",
>> line 216, in in_order
>>     f1(*args, **kwargs)
>>   File
>> "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/data_objects/hierarchy.py",
>> line 222, in _save_data
>>     arr = myGroup.create_dataset(name,data=array)
>>   File
>> "/nics/b/home/gsiisg/NautilusYT/lib/python2.7/site-packages/h5py-1.3.1-py2.7-linux-x86_64.egg/h5py/highlevel.py",
>> line 464, in create_dataset
>>     return Dataset(self, name, *args, **kwds)
>>   File
>> "/nics/b/home/gsiisg/NautilusYT/lib/python2.7/site-packages/h5py-1.3.1-py2.7-linux-x86_64.egg/h5py/highlevel.py",
>> line 1092, in __init__
>>     space_id = h5s.create_simple(shape, maxshape)
>>   File "h5s.pyx", line 103, in h5py.h5s.create_simple (h5py/h5s.c:952)
>> h5py._stub.ValueError: Zero sized dimension for non-unlimited dimension
>> (Invalid arguments to routine: Bad value)
>>
>> 7) How does yt report this crash to you, and is it deterministic?
>>
>> And many times there isn't any associated error output in the logs, the
>> process just hangs and become non-responsive, the admin has tried it a
>> couple times and seeing the different errors on 2 different dataset, so
>> right now it can also be the dataset that is corrupted, but so far not
>> deterministic.
>> 8) What have you attempted?  How did it change #6 and #7?
>> I've tried:
>> - adding the environmental variables:
>> export MPI_BUFS_PER_PROC=64
>> export MPI_BUFS_PER_HOST=256
>> with no change in behavior, resulting in MPI_Finalize() error sometimes
>> - using my own installation of OpenMPI
>>     from yt.mods import *
>>   File "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/mods.py", line 44, in
>> <module>
>>     from yt.data_objects.api import \
>>   File "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/data_objects/api.py",
>> line 34, in <module>
>>     from hierarchy import \
>>   File
>> "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/data_objects/hierarchy.py",
>> line 40, in <module>
>>     from yt.utilities.parallel_tools.parallel_analysis_interface import \
>>   File
>> "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/utilities/parallel_tools/parallel_analysis_interface.py",
>> line 49, in <module>
>>     from mpi4py import MPI
>> ImportError:
>> /nics/b/home/gsiisg/NautilusYT/lib/python2.7/site-packages/mpi4py/MPI.so:
>> undefined symbol: mpi_sgi_inplace
>> The system admin says there are bugs or incompatibilities with the network
>> and I should use SGI's MPI by using the module mpt/2.04 which I was using
>> before trying my own installation of openmpi.
>> - currently modifying my script with Stephen's proposed changes, once it
>> runs on my laptop will let the sys admin try it on the small dataset of 800
>> cube before trying it on the 3200 dataset.  At least when his job hangs the
>> machine he can terminate it faster without waiting for someone to answer his
>> emails.  Hopefully these tests wouldn't be too much of a disruption to other
>> Nautilus users.
>> - was speaking to Brian Crosby during the enzo meeting briefly about this,
>> he said he's encountered MPI errors on Nautilus as well, but his issue might
>> be a different one than mine.  This may or may not be a YT issue after all,
>> but since it seems like multiple people are interested in YT's performance
>> on Nautilus, I'll keep everyone updated with the latest development.
>>
>> From
>> G.S.
>> On Tue, Oct 18, 2011 at 7:59 PM, Matthew Turk <matthewturk at gmail.com> wrote:
>>>
>>> Geoffrey,
>>>
>>> Parallel HOP definitely does not attempt to load all of the particles,
>>> simultaneously, on all processors.  This is covered in the method
>>> papers for both p-hop and yt, the documentation for yt, the source
>>> code, and I believe on the yt-users mailing list a couple times when
>>> discussing estimates for resource usage in p-hop.
>>>
>>> The struggles you have been having with Nautilus may in fact be a yt
>>> problem, or an application-of-yt problem, a software problem on
>>> Nautilus, or even (if Nautilus is being exposed to an excessive number
>>> of cosmic rays, for instance) a hardware problem.  It would probably
>>> be productive to properly debug exactly what is going on for you to
>>> provide to us:
>>>
>>> 1) What are you attempting to do, precisely?
>>> 2) What type of data, and what size of data, are you applying this to?
>>> 3) What is the version of yt you are using (changeset hash)?
>>> 4) How are you launching yt?
>>> 5) What is the memory available to each individual process?
>>> 6) Under what circumstances does yt crash?
>>> 7) How does yt report this crash to you, and is it deterministic?
>>> 8) What have you attempted?  How did it change #6 and #7?
>>>
>>> We're interested in ensuring that yt functions well on Nautilus, and
>>> that it is able to successfully halo find, analyze, etc.  However,
>>> right now it feels like we're being given about 10% of a bug report,
>>> and that is regrettably not enough to properly diagnose and repair the
>>> problem.
>>>
>>> Thanks,
>>>
>>> Matt
>>>
>>> On Tue, Oct 18, 2011 at 7:51 PM, Geoffrey So <gsiisg at gmail.com> wrote:
>>> > Ah yes, I think that answers our question.
>>> > We were worried that all the particles were read in by each processor
>>> > (which
>>> > I told him I don't think it did, or it would have crashed my smaller 800
>>> > cube long ago), but I wanted to get the answer from pros.
>>> > Thanks!
>>> > From
>>> > G.S.
>>> >
>>> > On Tue, Oct 18, 2011 at 4:21 PM, Stephen Skory <s at skory.us> wrote:
>>> >>
>>> >> Geoffrey,
>>> >>
>>> >> > "Is the particle IO in YT that calls h5py spawned by multiple
>>> >> > processors
>>> >> > or is it doing it serially?"
>>> >>
>>> >> For your purposes, h5py is only used to *write* particle data to disk
>>> >> after the halos have been found (if you are saving them to disk, which
>>> >> you must do explicitly, of course). And in this case, it will open up
>>> >> one file using h5py per MPI task.
>>> >>
>>> >> I'm guessing that they're actually concerned about reading particle
>>> >> data, because that is more disk intensive. This is done with functions
>>> >> written in C that read the data, not h5py. Here each MPI task does its
>>> >> own reading of data, and may open up multiple files to retrieve the
>>> >> particle data it needs depending on the layouts of grids in the
>>> >> .cpuNNNN files.
>>> >>
>>> >> Does that help?
>>> >>
>>> >> --
>>> >> Stephen Skory
>>> >> s at skory.us
>>> >> http://stephenskory.com/
>>> >> 510.621.3687 (google voice)
>>> >> _______________________________________________
>>> >> Yt-dev mailing list
>>> >> Yt-dev at lists.spacepope.org
>>> >> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>> >
>>> >
>>> > _______________________________________________
>>> > Yt-dev mailing list
>>> > Yt-dev at lists.spacepope.org
>>> > http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>> >
>>> >
>>> _______________________________________________
>>> Yt-dev mailing list
>>> Yt-dev at lists.spacepope.org
>>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>
>>
>> _______________________________________________
>> Yt-dev mailing list
>> Yt-dev at lists.spacepope.org
>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>
>>
>