[Yt-dev] Projection speed improvement patch
John Wise
jwise at astro.princeton.edu
Fri Nov 6 04:29:03 PST 2009
Hi Matt,
Thanks so much for taking a look at my data and examining the memory
usage of analyzing a dataset of this size. I'll have to give it
another shot on ranger. I can also see how I/O performance is on the
Altix here at Princeton, which has a local RAID (just like red).
You said that I could do projections on my laptop once the computation
is done on a large machine. I know the projection structure is stored
in the .yt file, but are the projected fields also stored in the .yt
file? Or do I have to have the data on my laptop?
Thanks again!
John
On 5 Nov 2009, at 20:01, Matthew Turk wrote:
> Hi John,
>
> I ended up tracking down a bug in the newly refactored hierarchy (note
> about that below) but I benchmarked projections of the code on Triton.
> Unfortunately, the network disk was kind of dying, so the benchmarks
> are more IO dominated than I think they should be -- I might give it a
> go in a few days, if I notice the disk performing any better.
>
> It takes 250 seconds to do the IO and roughly 400 seconds total on 32
> processors. This is projecting all the way to L10, three fields (one
> of which is a derived field, composite from three, so a net of six
> fields are being read.) This means about 150 seconds for the
> instantiation (which is now negligible) and the math. Some processors
> even sat in the Alltoallv call for ~300 seconds (the upper left patch,
> for instance, which only goes to L6) so I believe I can now assert
> it's completely IO dominated. (Note on that below.)
>
> Looking at the memory usage after each level was done, the
> get_memory_usage() function -- which opens /proc/pid/shmem -- reports
> that the maximum memory usage per task is 1.6 Gigs. The final L10
> projection of the three fields + weight + position information takes
> 572Mb of space. Creating a 4098^2 (one pixel on either side for the
> border!) image takes about 5 seconds. With the plotting refactor, I
> anticipate this coming down, because right now it calls the
> Pixelization routine too many times. (The pixelization routine takes
> *less* time than the write-out-png routine!)
>
> Keep in mind that the yt image making process, once you have
> projected, is essentially free -- so once you do your projection (400
> seconds on 32 processors with suboptimal disk) you can make infinite
> images for free on, for instance, your laptop.
>
> The two things I'm still not completely set on:
>
> * The main bug I tracked down was that the grids were all off with
> the new hierarchy, but not the old. If I changed from calculating dx
> for all grids by (RE - LE)/dims to setting dx by Parent.dds /
> refine_factor, it worked just fine. But looking at the values, I
> don't see why this changed anything unless it also messed up the
> integer coordinates or the inclusion of grids in regions. Anyway,
> that now reproduces all the old behavior.
> * There are too many calls to the IO routines; it only reads from
> each file once, but for some reason it's calling the
> "ReadMultipleGrids" routine more than the number of CPU files, which
> means something is wrong. I'll possibly dig into this at a later
> date.
>
> Anyway, I'm pretty happy with this. :)
>
> -Matt
>
> On Tue, Nov 3, 2009 at 7:07 PM, Matthew Turk <matthewturk at gmail.com>
> wrote:
>> Wow, John -- that's simply unacceptable performance on yt's part,
>> both
>> from a memory and a timing standpoint.
>>
>> I'd love to take a look at this data, so if you want to toss it my
>> way, please do so!
>>
>> On Tue, Nov 3, 2009 at 7:02 PM, John Wise
>> <jwise at astro.princeton.edu> wrote:
>>> Hi Matt,
>>>
>>> That is great news! About two months ago, I tried doing full
>>> projections on
>>> a 768^3 AMR everywhere (10 levels) on ranger. But I've had
>>> problems running
>>> out of memory (never really bothered to check memory usage because
>>> you can't
>>> have interactive jobs ... to my knowledge). I was running with
>>> 256 cores
>>> (512GB RAM should be enough...). The I/O was taking forever,
>>> also. I ended
>>> up just doing projections of subvolumes.
>>>
>>> But I'll be sure to test your improved version (along with
>>> the .harrays
>>> file) and report back to the list!
>>>
>>> Thanks!
>>> John
>>>
>>>
>>> On 3 Nov 2009, at 15:43, Matthew Turk wrote:
>>>
>>>> Hi guys,
>>>>
>>>> I just wanted to report one more benchmark which might be
>>>> interesting
>>>> for a couple of you. I ran the same test, with *one additional
>>>> projected field* on Triton, and it takes 3:45 to project the entire
>>>> 512^3 L7 amr-everywhere dataset to the finest resolution,
>>>> projecting
>>>> Density, Temperature, VelocityMagnitude (requires 3 fields) and
>>>> Gravitational_Potential. This is over ethernet, rather than shared
>>>> memory (I did not use the myrinet interconnect for this test) and
>>>> it's
>>>> with an additional field -- so pretty good, I think.
>>>>
>>>> There are some issues with processors lagging, but I think they are
>>>> not a big deal anymore!
>>>>
>>>> -Matt
>>>>
>>>> On Mon, Nov 2, 2009 at 10:25 PM, Matthew Turk <matthewturk at gmail.com
>>>> >
>>>> wrote:
>>>>>
>>>>> Hi Sam,
>>>>>
>>>>> I guess you're right, without the info about the machine, this
>>>>> doesn't
>>>>> help much!
>>>>>
>>>>> This was running on a new machine at SLAC called 'orange-bigmem'
>>>>> --
>>>>> it's a 32node machine with a ton of memory available to all the
>>>>> processors. I checked memory usage at the end of the run, and
>>>>> after
>>>>> the projection ahd been save out a few times it was around 1.5
>>>>> gigs
>>>>> per node. I'm threading some outputs of the total memory usage
>>>>> through the projection code, and hopefully that will give us an
>>>>> idea
>>>>> of the peak memory usage.
>>>>>
>>>>> The file system is lustre, which works well with the preloading
>>>>> of the
>>>>> data, and I ran it a couple times beforehand to make sure that the
>>>>> files were in local cache or whatever.
>>>>>
>>>>> So the communication was via shared memory, which while still an
>>>>> MPI
>>>>> interface is much closer to ideal. I will be giving it a go on a
>>>>> cluster tomorrow, after I work out some kinks with data
>>>>> storage. I've
>>>>> moved the generation of the binary hierarchies into yt -- so if
>>>>> you
>>>>> don't have one, rather than dumping the hierarchy into the .yt
>>>>> file,
>>>>> it will dump it into the .harrays file. This way if anyone else
>>>>> writes an interface for the binary hierarchy method, we can all
>>>>> share
>>>>> it. (I think it would be a bad idea to have Enzo output a .yt
>>>>> file.
>>>>> ;-) The .yt file will now exist solely to store objects, not
>>>>> any of
>>>>> the hierarchy info.
>>>>>
>>>>> -Matt
>>>>>
>>>>> On Mon, Nov 2, 2009 at 10:13 PM, Sam Skillman <72Nova at gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> Hi Matt,
>>>>>> This is awesome. I don't think anyone can expect much faster
>>>>>> for that
>>>>>> dataset. I remember running projections just a year or so ago
>>>>>> on this
>>>>>> data
>>>>>> and it taking a whole lot more time (just reading in the data
>>>>>> took
>>>>>> ages).
>>>>>> What machine were you able to do this on? I'm mostly curious
>>>>>> about the
>>>>>> memory it used, or had available to it.
>>>>>> In any case, I'd say this is a pretty big success, and the binary
>>>>>> hierarchies are a great idea.
>>>>>> Cheers,
>>>>>> Sam
>>>>>> On Mon, Nov 2, 2009 at 8:47 PM, Matthew Turk <matthewturk at gmail.com
>>>>>> >
>>>>>> wrote:
>>>>>>>
>>>>>>> Hi guys,
>>>>>>>
>>>>>>> (For all of these performance indicators, I've used the 512^3 L7
>>>>>>> amr-everywhere run called the "LightCone." This particular
>>>>>>> dataset
>>>>>>> has ~380,000 grids and is a great place to find the )
>>>>>>>
>>>>>>> Last weekend I did a little bit of benchmarking and saw that the
>>>>>>> parallel projections (and likely several other parallel
>>>>>>> operations)
>>>>>>> all sat inside an MPI_Barrier for far too long. I converted (I
>>>>>>> think!) this process to be an MPI_Alltoallv operation,
>>>>>>> following on an
>>>>>>> MPI_Allreduce to get the final array size and the offsets into
>>>>>>> an
>>>>>>> ordered array, and I think it is working. I saw pretty good
>>>>>>> performance improvements, but it's tough to quantify those
>>>>>>> right now
>>>>>>> -- for projecting "Ones" (no disk-access) it sped things up by
>>>>>>> ~15%.
>>>>>>>
>>>>>>> I've also added a new binary hierarchy method to devel enzo,
>>>>>>> and it
>>>>>>> provides everything that is necessary for yt to analyze the
>>>>>>> data. As
>>>>>>> such, if a %(basename)s.harrays file exists, it will be used,
>>>>>>> and yt
>>>>>>> will not need to open the .hierarchy file at all. This sped
>>>>>>> things up
>>>>>>> by 100 seconds. I've written a script to create these
>>>>>>> (http://www.slac.stanford.edu/~mturk/create_harrays.py), but
>>>>>>> outputting them inline in Enzo is the fastest.
>>>>>>>
>>>>>>> To top this all off, I ran a projection -- start to finish,
>>>>>>> including
>>>>>>> all overhead -- on 16 processors. To project the fields
>>>>>>> "Density"
>>>>>>> (native), "Temperature" (native) and
>>>>>>> "VelocityMagnitude" (derived,
>>>>>>> requires x-, y- and z-velocity) on 16 processors to the finest
>>>>>>> resolution (adaptive projection -- to L7) takes 140 seconds, or
>>>>>>> roughly 2:20.
>>>>>>>
>>>>>>> I've looked at the profiling outputs, and it seems to me that
>>>>>>> there
>>>>>>> are still some places performance could be squeezed out. That
>>>>>>> being
>>>>>>> said, I'm pretty pleased with these results.
>>>>>>>
>>>>>>> These are all in the named branch hierarchy-opt in mercurial.
>>>>>>> They
>>>>>>> rely on some rearrangement of the hierarchy parsing and
>>>>>>> whatnot that
>>>>>>> has lived in hg for a little while; it will go into the trunk
>>>>>>> as soon
>>>>>>> as I get the all clear about moving to a proper stable/less-
>>>>>>> stable dev
>>>>>>> environment. I also have some other test suites to run on
>>>>>>> them, and I
>>>>>>> want to make sure the memory usage is not excessive.
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Matt
>>>>>>> _______________________________________________
>>>>>>> Yt-dev mailing list
>>>>>>> Yt-dev at lists.spacepope.org
>>>>>>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Samuel W. Skillman
>>>>>> DOE Computational Science Graduate Fellow
>>>>>> Center for Astrophysics and Space Astronomy
>>>>>> University of Colorado at Boulder
>>>>>> samuel.skillman[at]colorado.edu
>>>>>>
>>>>>> _______________________________________________
>>>>>> Yt-dev mailing list
>>>>>> Yt-dev at lists.spacepope.org
>>>>>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>>>>>
>>>>>>
>>>>>
>>>> _______________________________________________
>>>> Yt-dev mailing list
>>>> Yt-dev at lists.spacepope.org
>>>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>>
>>> _______________________________________________
>>> Yt-dev mailing list
>>> Yt-dev at lists.spacepope.org
>>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>>
>>
> _______________________________________________
> Yt-dev mailing list
> Yt-dev at lists.spacepope.org
> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
More information about the yt-dev
mailing list