[Yt-dev] Projection speed improvement patch

Fri Nov 6 04:29:03 PST 2009

Hi Matt,

Thanks so much for taking a look at my data and examining the memory  
usage of analyzing a dataset of this size.  I'll have to give it  
another shot on ranger.  I can also see how I/O performance is on the  
Altix here at Princeton, which has a local RAID (just like red).

You said that I could do projections on my laptop once the computation  
is done on a large machine.  I know the projection structure is stored  
in the .yt file, but are the projected fields also stored in the .yt  
file?  Or do I have to have the data on my laptop?

Thanks again!
John

On 5 Nov 2009, at 20:01, Matthew Turk wrote:

> Hi John,
>
> I ended up tracking down a bug in the newly refactored hierarchy (note
> about that below) but I benchmarked projections of the code on Triton.
> Unfortunately, the network disk was kind of dying, so the benchmarks
> are more IO dominated than I think they should be -- I might give it a
> go in a few days, if I notice the disk performing any better.
>
> It takes 250 seconds to do the IO and roughly 400 seconds total on 32
> processors.  This is projecting all the way to L10, three fields (one
> of which is a derived field, composite from three, so a net of six
> fields are being read.)  This means about 150 seconds for the
> instantiation (which is now negligible) and the math.  Some processors
> even sat in the Alltoallv call for ~300 seconds (the upper left patch,
> for instance, which only goes to L6) so I believe I can now assert
> it's completely IO dominated.  (Note on that below.)
>
> Looking at the memory usage after each level was done, the
> get_memory_usage() function -- which opens /proc/pid/shmem -- reports
> that the maximum memory usage per task is 1.6 Gigs.  The final L10
> projection of the three fields + weight + position information takes
> 572Mb of space.  Creating a 4098^2 (one pixel on either side for the
> border!) image takes about 5 seconds.  With the plotting refactor, I
> anticipate this coming down, because right now it calls the
> Pixelization routine too many times.  (The pixelization routine takes
> *less* time than the write-out-png routine!)
>
> Keep in mind that the yt image making process, once you have
> projected, is essentially free -- so once you do your projection (400
> seconds on 32 processors with suboptimal disk) you can make infinite
> images for free on, for instance, your laptop.
>
> The two things I'm still not completely set on:
>
> * The main bug I tracked down was that the grids were all off with
> the new hierarchy, but not the old.  If I changed from calculating dx
> for all grids by (RE - LE)/dims to setting dx by Parent.dds /
> refine_factor, it worked just fine.  But looking at the values, I
> don't see why this changed anything unless it also messed up the
> integer coordinates or the inclusion of grids in regions.  Anyway,
> that now reproduces all the old behavior.
> * There are too many calls to the IO routines; it only reads from
> each file once, but for some reason it's calling the
> "ReadMultipleGrids" routine more than the number of CPU files, which
> means something is wrong.  I'll possibly dig into this at a later
> date.
>
> Anyway, I'm pretty happy with this.  :)
>
> -Matt
>
> On Tue, Nov 3, 2009 at 7:07 PM, Matthew Turk <matthewturk at gmail.com>  
> wrote:
>> Wow, John -- that's simply unacceptable performance on yt's part,  
>> both
>> from a memory and a timing standpoint.
>>
>> I'd love to take a look at this data, so if you want to toss it my
>> way, please do so!
>>
>> On Tue, Nov 3, 2009 at 7:02 PM, John Wise  
>> <jwise at astro.princeton.edu> wrote:
>>> Hi Matt,
>>>
>>> That is great news!  About two months ago, I tried doing full  
>>> projections on
>>> a 768^3 AMR everywhere (10 levels) on ranger.  But I've had  
>>> problems running
>>> out of memory (never really bothered to check memory usage because  
>>> you can't
>>> have interactive jobs ... to my knowledge).  I was running with  
>>> 256 cores
>>> (512GB RAM should be enough...).  The I/O was taking forever,  
>>> also.  I ended
>>> up just doing projections of subvolumes.
>>>
>>> But I'll be sure to test your improved version (along with  
>>> the .harrays
>>> file) and report back to the list!
>>>
>>> Thanks!
>>> John
>>>
>>>
>>> On 3 Nov 2009, at 15:43, Matthew Turk wrote:
>>>
>>>> Hi guys,
>>>>
>>>> I just wanted to report one more benchmark which might be  
>>>> interesting
>>>> for a couple of you.  I ran the same test, with *one additional
>>>> projected field* on Triton, and it takes 3:45 to project the entire
>>>> 512^3 L7 amr-everywhere dataset to the finest resolution,  
>>>> projecting
>>>> Density, Temperature, VelocityMagnitude (requires 3 fields) and
>>>> Gravitational_Potential.  This is over ethernet, rather than shared
>>>> memory (I did not use the myrinet interconnect for this test) and  
>>>> it's
>>>> with an additional field -- so pretty good, I think.
>>>>
>>>> There are some issues with processors lagging, but I think they are
>>>> not a big deal anymore!
>>>>
>>>> -Matt
>>>>
>>>> On Mon, Nov 2, 2009 at 10:25 PM, Matthew Turk <matthewturk at gmail.com 
>>>> >
>>>> wrote:
>>>>>
>>>>> Hi Sam,
>>>>>
>>>>> I guess you're right, without the info about the machine, this  
>>>>> doesn't
>>>>> help much!
>>>>>
>>>>> This was running on a new machine at SLAC called 'orange-bigmem'  
>>>>> --
>>>>> it's a 32node machine with a ton of memory available to all the
>>>>> processors.  I checked memory usage at the end of the run, and  
>>>>> after
>>>>> the projection ahd been save out a few times it was around 1.5  
>>>>> gigs
>>>>> per node.  I'm threading some outputs of the total memory usage
>>>>> through the projection code, and hopefully that will give us an  
>>>>> idea
>>>>> of the peak memory usage.
>>>>>
>>>>> The file system is lustre, which works well with the preloading  
>>>>> of the
>>>>> data, and I ran it a couple times beforehand to make sure that the
>>>>> files were in local cache or whatever.
>>>>>
>>>>> So the communication was via shared memory, which while still an  
>>>>> MPI
>>>>> interface is much closer to ideal.  I will be giving it a go on a
>>>>> cluster tomorrow, after I work out some kinks with data  
>>>>> storage.  I've
>>>>> moved the generation of the binary hierarchies into yt -- so if  
>>>>> you
>>>>> don't have one, rather than dumping the hierarchy into the .yt  
>>>>> file,
>>>>> it will dump it into the .harrays file.  This way if anyone else
>>>>> writes an interface for the binary hierarchy method, we can all  
>>>>> share
>>>>> it.  (I think it would be a bad idea to have Enzo output a .yt  
>>>>> file.
>>>>> ;-)  The .yt file will now exist solely to store objects, not  
>>>>> any of
>>>>> the hierarchy info.
>>>>>
>>>>> -Matt
>>>>>
>>>>> On Mon, Nov 2, 2009 at 10:13 PM, Sam Skillman <72Nova at gmail.com>  
>>>>> wrote:
>>>>>>
>>>>>> Hi Matt,
>>>>>> This is awesome.  I don't think anyone can expect much faster  
>>>>>> for that
>>>>>> dataset.  I remember running projections just a year or so ago  
>>>>>> on this
>>>>>> data
>>>>>> and it taking a whole lot more time (just reading in the data  
>>>>>> took
>>>>>> ages).
>>>>>>  What machine were you able to do this on?  I'm mostly curious  
>>>>>> about the
>>>>>> memory it used, or had available to it.
>>>>>> In any case, I'd say this is a pretty big success, and the binary
>>>>>> hierarchies are a great idea.
>>>>>> Cheers,
>>>>>> Sam
>>>>>> On Mon, Nov 2, 2009 at 8:47 PM, Matthew Turk <matthewturk at gmail.com 
>>>>>> >
>>>>>> wrote:
>>>>>>>
>>>>>>> Hi guys,
>>>>>>>
>>>>>>> (For all of these performance indicators, I've used the 512^3 L7
>>>>>>> amr-everywhere run called the "LightCone."  This particular  
>>>>>>> dataset
>>>>>>> has ~380,000 grids and is a great place to find the )
>>>>>>>
>>>>>>> Last weekend I did a little bit of benchmarking and saw that the
>>>>>>> parallel projections (and likely several other parallel  
>>>>>>> operations)
>>>>>>> all sat inside an MPI_Barrier for far too long.  I converted (I
>>>>>>> think!) this process to be an MPI_Alltoallv operation,  
>>>>>>> following on an
>>>>>>> MPI_Allreduce to get the final array size and the offsets into  
>>>>>>> an
>>>>>>> ordered array, and I think it is working.  I saw pretty good
>>>>>>> performance improvements, but it's tough to quantify those  
>>>>>>> right now
>>>>>>> -- for projecting "Ones" (no disk-access) it sped things up by  
>>>>>>> ~15%.
>>>>>>>
>>>>>>> I've also added a new binary hierarchy method to devel enzo,  
>>>>>>> and it
>>>>>>> provides everything that is necessary for yt to analyze the  
>>>>>>> data.  As
>>>>>>> such, if a %(basename)s.harrays file exists, it will be used,  
>>>>>>> and yt
>>>>>>> will not need to open the .hierarchy file at all.  This sped  
>>>>>>> things up
>>>>>>> by 100 seconds.  I've written a script to create these
>>>>>>> (http://www.slac.stanford.edu/~mturk/create_harrays.py), but
>>>>>>> outputting them inline in Enzo is the fastest.
>>>>>>>
>>>>>>> To top this all off, I ran a projection -- start to finish,  
>>>>>>> including
>>>>>>> all overhead -- on 16 processors.  To project the fields  
>>>>>>> "Density"
>>>>>>> (native), "Temperature" (native) and  
>>>>>>> "VelocityMagnitude" (derived,
>>>>>>> requires x-, y- and z-velocity) on 16 processors to the finest
>>>>>>> resolution (adaptive projection -- to L7) takes 140 seconds, or
>>>>>>> roughly 2:20.
>>>>>>>
>>>>>>> I've looked at the profiling outputs, and it seems to me that  
>>>>>>> there
>>>>>>> are still some places performance could be squeezed out.  That  
>>>>>>> being
>>>>>>> said, I'm pretty pleased with these results.
>>>>>>>
>>>>>>> These are all in the named branch hierarchy-opt in mercurial.   
>>>>>>> They
>>>>>>> rely on some rearrangement of the hierarchy parsing and  
>>>>>>> whatnot that
>>>>>>> has lived in hg for a little while; it will go into the trunk  
>>>>>>> as soon
>>>>>>> as I get the all clear about moving to a proper stable/less- 
>>>>>>> stable dev
>>>>>>> environment.  I also have some other test suites to run on  
>>>>>>> them, and I
>>>>>>> want to make sure the memory usage is not excessive.
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Matt
>>>>>>> _______________________________________________
>>>>>>> Yt-dev mailing list
>>>>>>> Yt-dev at lists.spacepope.org
>>>>>>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Samuel W. Skillman
>>>>>> DOE Computational Science Graduate Fellow
>>>>>> Center for Astrophysics and Space Astronomy
>>>>>> University of Colorado at Boulder
>>>>>> samuel.skillman[at]colorado.edu
>>>>>>
>>>>>> _______________________________________________
>>>>>> Yt-dev mailing list
>>>>>> Yt-dev at lists.spacepope.org
>>>>>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>>>>>
>>>>>>
>>>>>
>>>> _______________________________________________
>>>> Yt-dev mailing list
>>>> Yt-dev at lists.spacepope.org
>>>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>>
>>> _______________________________________________
>>> Yt-dev mailing list
>>> Yt-dev at lists.spacepope.org
>>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>>
>>
> _______________________________________________
> Yt-dev mailing list
> Yt-dev at lists.spacepope.org
> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org