[Yt-dev] Projection speed improvement patch

Thu Nov 5 17:01:38 PST 2009

Hi John,

I ended up tracking down a bug in the newly refactored hierarchy (note
about that below) but I benchmarked projections of the code on Triton.
 Unfortunately, the network disk was kind of dying, so the benchmarks
are more IO dominated than I think they should be -- I might give it a
go in a few days, if I notice the disk performing any better.

It takes 250 seconds to do the IO and roughly 400 seconds total on 32
processors.  This is projecting all the way to L10, three fields (one
of which is a derived field, composite from three, so a net of six
fields are being read.)  This means about 150 seconds for the
instantiation (which is now negligible) and the math.  Some processors
even sat in the Alltoallv call for ~300 seconds (the upper left patch,
for instance, which only goes to L6) so I believe I can now assert
it's completely IO dominated.  (Note on that below.)

Looking at the memory usage after each level was done, the
get_memory_usage() function -- which opens /proc/pid/shmem -- reports
that the maximum memory usage per task is 1.6 Gigs.  The final L10
projection of the three fields + weight + position information takes
572Mb of space.  Creating a 4098^2 (one pixel on either side for the
border!) image takes about 5 seconds.  With the plotting refactor, I
anticipate this coming down, because right now it calls the
Pixelization routine too many times.  (The pixelization routine takes
*less* time than the write-out-png routine!)

Keep in mind that the yt image making process, once you have
projected, is essentially free -- so once you do your projection (400
seconds on 32 processors with suboptimal disk) you can make infinite
images for free on, for instance, your laptop.

The two things I'm still not completely set on:

 * The main bug I tracked down was that the grids were all off with
the new hierarchy, but not the old.  If I changed from calculating dx
for all grids by (RE - LE)/dims to setting dx by Parent.dds /
refine_factor, it worked just fine.  But looking at the values, I
don't see why this changed anything unless it also messed up the
integer coordinates or the inclusion of grids in regions.  Anyway,
that now reproduces all the old behavior.
 * There are too many calls to the IO routines; it only reads from
each file once, but for some reason it's calling the
"ReadMultipleGrids" routine more than the number of CPU files, which
means something is wrong.  I'll possibly dig into this at a later
date.

Anyway, I'm pretty happy with this.  :)

-Matt

On Tue, Nov 3, 2009 at 7:07 PM, Matthew Turk <matthewturk at gmail.com> wrote:
> Wow, John -- that's simply unacceptable performance on yt's part, both
> from a memory and a timing standpoint.
>
> I'd love to take a look at this data, so if you want to toss it my
> way, please do so!
>
> On Tue, Nov 3, 2009 at 7:02 PM, John Wise <jwise at astro.princeton.edu> wrote:
>> Hi Matt,
>>
>> That is great news!  About two months ago, I tried doing full projections on
>> a 768^3 AMR everywhere (10 levels) on ranger.  But I've had problems running
>> out of memory (never really bothered to check memory usage because you can't
>> have interactive jobs ... to my knowledge).  I was running with 256 cores
>> (512GB RAM should be enough...).  The I/O was taking forever, also.  I ended
>> up just doing projections of subvolumes.
>>
>> But I'll be sure to test your improved version (along with the .harrays
>> file) and report back to the list!
>>
>> Thanks!
>> John
>>
>>
>> On 3 Nov 2009, at 15:43, Matthew Turk wrote:
>>
>>> Hi guys,
>>>
>>> I just wanted to report one more benchmark which might be interesting
>>> for a couple of you.  I ran the same test, with *one additional
>>> projected field* on Triton, and it takes 3:45 to project the entire
>>> 512^3 L7 amr-everywhere dataset to the finest resolution, projecting
>>> Density, Temperature, VelocityMagnitude (requires 3 fields) and
>>> Gravitational_Potential.  This is over ethernet, rather than shared
>>> memory (I did not use the myrinet interconnect for this test) and it's
>>> with an additional field -- so pretty good, I think.
>>>
>>> There are some issues with processors lagging, but I think they are
>>> not a big deal anymore!
>>>
>>> -Matt
>>>
>>> On Mon, Nov 2, 2009 at 10:25 PM, Matthew Turk <matthewturk at gmail.com>
>>> wrote:
>>>>
>>>> Hi Sam,
>>>>
>>>> I guess you're right, without the info about the machine, this doesn't
>>>> help much!
>>>>
>>>> This was running on a new machine at SLAC called 'orange-bigmem' --
>>>> it's a 32node machine with a ton of memory available to all the
>>>> processors.  I checked memory usage at the end of the run, and after
>>>> the projection ahd been save out a few times it was around 1.5 gigs
>>>> per node.  I'm threading some outputs of the total memory usage
>>>> through the projection code, and hopefully that will give us an idea
>>>> of the peak memory usage.
>>>>
>>>> The file system is lustre, which works well with the preloading of the
>>>> data, and I ran it a couple times beforehand to make sure that the
>>>> files were in local cache or whatever.
>>>>
>>>> So the communication was via shared memory, which while still an MPI
>>>> interface is much closer to ideal.  I will be giving it a go on a
>>>> cluster tomorrow, after I work out some kinks with data storage.  I've
>>>> moved the generation of the binary hierarchies into yt -- so if you
>>>> don't have one, rather than dumping the hierarchy into the .yt file,
>>>> it will dump it into the .harrays file.  This way if anyone else
>>>> writes an interface for the binary hierarchy method, we can all share
>>>> it.  (I think it would be a bad idea to have Enzo output a .yt file.
>>>> ;-)  The .yt file will now exist solely to store objects, not any of
>>>> the hierarchy info.
>>>>
>>>> -Matt
>>>>
>>>> On Mon, Nov 2, 2009 at 10:13 PM, Sam Skillman <72Nova at gmail.com> wrote:
>>>>>
>>>>> Hi Matt,
>>>>> This is awesome.  I don't think anyone can expect much faster for that
>>>>> dataset.  I remember running projections just a year or so ago on this
>>>>> data
>>>>> and it taking a whole lot more time (just reading in the data took
>>>>> ages).
>>>>>  What machine were you able to do this on?  I'm mostly curious about the
>>>>> memory it used, or had available to it.
>>>>> In any case, I'd say this is a pretty big success, and the binary
>>>>> hierarchies are a great idea.
>>>>> Cheers,
>>>>> Sam
>>>>> On Mon, Nov 2, 2009 at 8:47 PM, Matthew Turk <matthewturk at gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> Hi guys,
>>>>>>
>>>>>> (For all of these performance indicators, I've used the 512^3 L7
>>>>>> amr-everywhere run called the "LightCone."  This particular dataset
>>>>>> has ~380,000 grids and is a great place to find the )
>>>>>>
>>>>>> Last weekend I did a little bit of benchmarking and saw that the
>>>>>> parallel projections (and likely several other parallel operations)
>>>>>> all sat inside an MPI_Barrier for far too long.  I converted (I
>>>>>> think!) this process to be an MPI_Alltoallv operation, following on an
>>>>>> MPI_Allreduce to get the final array size and the offsets into an
>>>>>> ordered array, and I think it is working.  I saw pretty good
>>>>>> performance improvements, but it's tough to quantify those right now
>>>>>> -- for projecting "Ones" (no disk-access) it sped things up by ~15%.
>>>>>>
>>>>>> I've also added a new binary hierarchy method to devel enzo, and it
>>>>>> provides everything that is necessary for yt to analyze the data.  As
>>>>>> such, if a %(basename)s.harrays file exists, it will be used, and yt
>>>>>> will not need to open the .hierarchy file at all.  This sped things up
>>>>>> by 100 seconds.  I've written a script to create these
>>>>>> (http://www.slac.stanford.edu/~mturk/create_harrays.py), but
>>>>>> outputting them inline in Enzo is the fastest.
>>>>>>
>>>>>> To top this all off, I ran a projection -- start to finish, including
>>>>>> all overhead -- on 16 processors.  To project the fields "Density"
>>>>>> (native), "Temperature" (native) and "VelocityMagnitude" (derived,
>>>>>> requires x-, y- and z-velocity) on 16 processors to the finest
>>>>>> resolution (adaptive projection -- to L7) takes 140 seconds, or
>>>>>> roughly 2:20.
>>>>>>
>>>>>> I've looked at the profiling outputs, and it seems to me that there
>>>>>> are still some places performance could be squeezed out.  That being
>>>>>> said, I'm pretty pleased with these results.
>>>>>>
>>>>>> These are all in the named branch hierarchy-opt in mercurial.  They
>>>>>> rely on some rearrangement of the hierarchy parsing and whatnot that
>>>>>> has lived in hg for a little while; it will go into the trunk as soon
>>>>>> as I get the all clear about moving to a proper stable/less-stable dev
>>>>>> environment.  I also have some other test suites to run on them, and I
>>>>>> want to make sure the memory usage is not excessive.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Matt
>>>>>> _______________________________________________
>>>>>> Yt-dev mailing list
>>>>>> Yt-dev at lists.spacepope.org
>>>>>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Samuel W. Skillman
>>>>> DOE Computational Science Graduate Fellow
>>>>> Center for Astrophysics and Space Astronomy
>>>>> University of Colorado at Boulder
>>>>> samuel.skillman[at]colorado.edu
>>>>>
>>>>> _______________________________________________
>>>>> Yt-dev mailing list
>>>>> Yt-dev at lists.spacepope.org
>>>>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>>>>
>>>>>
>>>>
>>> _______________________________________________
>>> Yt-dev mailing list
>>> Yt-dev at lists.spacepope.org
>>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>
>> _______________________________________________
>> Yt-dev mailing list
>> Yt-dev at lists.spacepope.org
>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>
>