[Yt-dev] Projection speed improvement patch

Tue Nov 3 11:43:45 PST 2009

Hi guys,

I just wanted to report one more benchmark which might be interesting
for a couple of you.  I ran the same test, with *one additional
projected field* on Triton, and it takes 3:45 to project the entire
512^3 L7 amr-everywhere dataset to the finest resolution, projecting
Density, Temperature, VelocityMagnitude (requires 3 fields) and
Gravitational_Potential.  This is over ethernet, rather than shared
memory (I did not use the myrinet interconnect for this test) and it's
with an additional field -- so pretty good, I think.

There are some issues with processors lagging, but I think they are
not a big deal anymore!

-Matt

On Mon, Nov 2, 2009 at 10:25 PM, Matthew Turk <matthewturk at gmail.com> wrote:
> Hi Sam,
>
> I guess you're right, without the info about the machine, this doesn't
> help much!
>
> This was running on a new machine at SLAC called 'orange-bigmem' --
> it's a 32node machine with a ton of memory available to all the
> processors.  I checked memory usage at the end of the run, and after
> the projection ahd been save out a few times it was around 1.5 gigs
> per node.  I'm threading some outputs of the total memory usage
> through the projection code, and hopefully that will give us an idea
> of the peak memory usage.
>
> The file system is lustre, which works well with the preloading of the
> data, and I ran it a couple times beforehand to make sure that the
> files were in local cache or whatever.
>
> So the communication was via shared memory, which while still an MPI
> interface is much closer to ideal.  I will be giving it a go on a
> cluster tomorrow, after I work out some kinks with data storage.  I've
> moved the generation of the binary hierarchies into yt -- so if you
> don't have one, rather than dumping the hierarchy into the .yt file,
> it will dump it into the .harrays file.  This way if anyone else
> writes an interface for the binary hierarchy method, we can all share
> it.  (I think it would be a bad idea to have Enzo output a .yt file.
> ;-)  The .yt file will now exist solely to store objects, not any of
> the hierarchy info.
>
> -Matt
>
> On Mon, Nov 2, 2009 at 10:13 PM, Sam Skillman <72Nova at gmail.com> wrote:
>> Hi Matt,
>> This is awesome.  I don't think anyone can expect much faster for that
>> dataset.  I remember running projections just a year or so ago on this data
>> and it taking a whole lot more time (just reading in the data took ages).
>>  What machine were you able to do this on?  I'm mostly curious about the
>> memory it used, or had available to it.
>> In any case, I'd say this is a pretty big success, and the binary
>> hierarchies are a great idea.
>> Cheers,
>> Sam
>> On Mon, Nov 2, 2009 at 8:47 PM, Matthew Turk <matthewturk at gmail.com> wrote:
>>>
>>> Hi guys,
>>>
>>> (For all of these performance indicators, I've used the 512^3 L7
>>> amr-everywhere run called the "LightCone."  This particular dataset
>>> has ~380,000 grids and is a great place to find the )
>>>
>>> Last weekend I did a little bit of benchmarking and saw that the
>>> parallel projections (and likely several other parallel operations)
>>> all sat inside an MPI_Barrier for far too long.  I converted (I
>>> think!) this process to be an MPI_Alltoallv operation, following on an
>>> MPI_Allreduce to get the final array size and the offsets into an
>>> ordered array, and I think it is working.  I saw pretty good
>>> performance improvements, but it's tough to quantify those right now
>>> -- for projecting "Ones" (no disk-access) it sped things up by ~15%.
>>>
>>> I've also added a new binary hierarchy method to devel enzo, and it
>>> provides everything that is necessary for yt to analyze the data.  As
>>> such, if a %(basename)s.harrays file exists, it will be used, and yt
>>> will not need to open the .hierarchy file at all.  This sped things up
>>> by 100 seconds.  I've written a script to create these
>>> (http://www.slac.stanford.edu/~mturk/create_harrays.py), but
>>> outputting them inline in Enzo is the fastest.
>>>
>>> To top this all off, I ran a projection -- start to finish, including
>>> all overhead -- on 16 processors.  To project the fields "Density"
>>> (native), "Temperature" (native) and "VelocityMagnitude" (derived,
>>> requires x-, y- and z-velocity) on 16 processors to the finest
>>> resolution (adaptive projection -- to L7) takes 140 seconds, or
>>> roughly 2:20.
>>>
>>> I've looked at the profiling outputs, and it seems to me that there
>>> are still some places performance could be squeezed out.  That being
>>> said, I'm pretty pleased with these results.
>>>
>>> These are all in the named branch hierarchy-opt in mercurial.  They
>>> rely on some rearrangement of the hierarchy parsing and whatnot that
>>> has lived in hg for a little while; it will go into the trunk as soon
>>> as I get the all clear about moving to a proper stable/less-stable dev
>>> environment.  I also have some other test suites to run on them, and I
>>> want to make sure the memory usage is not excessive.
>>>
>>> Best,
>>>
>>> Matt
>>> _______________________________________________
>>> Yt-dev mailing list
>>> Yt-dev at lists.spacepope.org
>>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>
>>
>>
>> --
>> Samuel W. Skillman
>> DOE Computational Science Graduate Fellow
>> Center for Astrophysics and Space Astronomy
>> University of Colorado at Boulder
>> samuel.skillman[at]colorado.edu
>>
>> _______________________________________________
>> Yt-dev mailing list
>> Yt-dev at lists.spacepope.org
>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>
>>
>