[yt-dev] Call for testing: Projection performance

Wed Jun 27 10:43:21 PDT 2012

John tested the quadtree projection improvements in PR #156 (
https://bitbucket.org/yt_analysis/yt/pull-request/156/change-the-way-quad-tree-projections-are
) and found them to improve performance, reduce peak memory usage, and
flatten overall memory usage.  He accepted the PR, so parallel
projections should be substantially improved now.

-Matt

On Mon, May 7, 2012 at 9:18 AM, Matthew Turk <matthewturk at gmail.com> wrote:
> Hi everyone,
>
> Sam and I also had an off-list discussion a bit about this, and we saw
> similar but not identical results to what Nathan has reported.
> (Specifically, Sam sees a lot less memory overhead than Nathan does.)
> Sorry for dropping this over the weekend and bringing it back up now;
> I had a lot going on the last couple days.  I think the key point in
> what Nathan writes is that the QT gets to be fast enough to be
> overwhelmed by overhead.  Also, it's not clear to me if the numbers
> Nathan is reporting include hierarchy instantiation or not.
>
> This leads to a couple things --
>
>  * How can overhead be reduced?  This includes both the time to
> generate grid objects in memory as well as the memory of those grid
> objects and the affiliated hierarchy.  Parsing the file off disk is a
> one-time affair, so I am substantially less concerned about that.
>  * Where are the weak spots in the projection algorithm?  At some
> point, we need to do a global reduce.  The mechanism that I've
> identified here should feature the least amount of memory allocation,
> but it could perhaps be improved in other ways.
>  * What are the low-hanging fruits of optimization inside the existing
> algorithm?
>
> There are a few operations that I already know of problems in (2D
> Profiles is a big one), which are slated to be dealt with in the 2.5
> cycle.
>
> A while back for the Blue Waters project I wrote up a little bit of
> infrastructure to do profiling of time spent in various routines.  I'm
> going to take time to resurrect this, and add in memory management
> profiling as well.  Following this, I'll run this regularly to inspect
> the current state of the code.  What I'd really like to ask for is
> help with identifying routines that may need to be inspected in
> detail, and trying to figure out how we can speed them up.  I'm not
> interested in ceding performance territory, but unfortunately this is
> an area we haven't really ever discussed openly on this list.
>
> So:
>
> What routines are slow?
>
> -Matt
>
>
> On Fri, May 4, 2012 at 8:43 PM, Nathan Goldbaum <goldbaum at ucolick.org> wrote:
>> Hi all,
>>
>> I just did a scaling test on Pleiades at NASA Ames and got somewhat worse
>> scaling at high processor counts.  This is with one Matt's datasets so that
>> might be the issue.
>>
>> Here's some summary data for the
>> hierarchy: http://paste.yt-project.org/show/2348/
>>
>> I actually found superlinear scaling going from 1 processor to 2 processors
>> so I made two different scaling plots.  I think the second plot (assuming
>> the 2 core run is representative of the true serial performance) is probably
>> more accurate.
>>
>> http://imgur.com/5pQ2P
>>
>> http://imgur.com/pOKty
>>
>> Since I was running these jobs interactively, I was able to get a pretty
>> good feel for which parts of the calculation were most time-consuming.  As
>> the plots above show, beyond 8 processors, the projection operation was so
>> fast that increasing the processor count really didn't help much.  Most of
>> the overhead is in parsing the hierarchy, setting up the MPI communicators,
>> and communicating and assembling the projection on all processors at the
>> end, in rough order of importance.
>>
>> This is also quite memory intensive -  each core (independent of the global
>> number of cores) needed at least 4.5 gigabytes of memory.
>>
>> Nathan Goldbaum
>> Graduate Student
>> Astronomy & Astrophysics, UCSC
>> goldbaum at ucolick.org
>> http://www.ucolick.org/~goldbaum
>>
>> On May 4, 2012, at 4:20 AM, Matthew Turk wrote:
>>
>> Hi Sam,
>>
>> Thanks a ton.  This looks good to me, seeing as how at few tasks we
>> have the overhead of creating the tree, and at many tasks we'll have
>> collective operations.  I'll try to get ahold of another testing
>> machine and then I'll issue a PR.  (And close Issue #348!)
>>
>> -Matt
>>
>> On Thu, May 3, 2012 at 6:47 PM, Sam Skillman <samskillman at gmail.com> wrote:
>>
>> Meant to include the scaling image.
>>
>>
>>
>> On Thu, May 3, 2012 at 4:44 PM, Sam Skillman <samskillman at gmail.com> wrote:
>>
>>
>> Hi Matt & friends,
>>
>>
>> I tested this on a fairly large nested simulation with about 60k grids
>>
>> using 6 nodes of Janus (dual-hex nodes) and ran from 1 to 64 processors.  I
>>
>> got fairly good scaling and made a quick mercurial repo on bitbucket with
>>
>> everything except the dataset needed to do a similar
>>
>> study. https://bitbucket.org/samskillman/quad-tree-proj-performance
>>
>>
>> Raw timing:
>>
>> projects/quad_proj_scale:more perf.dat
>>
>> 64 2.444e+01
>>
>> 32 4.834e+01
>>
>> 16 7.364e+01
>>
>> 8 1.125e+02
>>
>> 4 1.853e+02
>>
>> 2 3.198e+02
>>
>> 1 6.370e+02
>>
>>
>> A few notes:
>>
>> -- I ran with 64 cores first, then again so that the disks were somewhat
>>
>> warmed up, then only used the second timing of the 64 core run.
>>
>> -- While I did get full nodes, the machine doesn't have a ton of I/O nodes
>>
>> so in an ideal setting performance may be even better.
>>
>> -- My guess would be that a lot of this speedup comes from having a
>>
>> parallel filesystem, so you may not get as great of speedups on your laptop.
>>
>> -- Speedup from 32 to 64 is nearly ideal...this is great.
>>
>>
>> This looks pretty great to me, and I'd +1 any PR.
>>
>>
>> Sam
>>
>>
>> On Thu, May 3, 2012 at 1:42 PM, Matthew Turk <matthewturk at gmail.com>
>>
>> wrote:
>>
>>
>> Hi all,
>>
>>
>> I implemented this "quadtree extension" that duplicates the quadtree
>>
>> on all processors, which may make it nicer to scale projections.
>>
>> Previously the procedure was:
>>
>>
>> 1) Locally project
>>
>> 2) Merge across procs:
>>
>>  2a) Serialize quadtree
>>
>>  2b) Point-to-point communciate
>>
>>  2c) Deserialize
>>
>>  2d) Merge local and remote
>>
>>  2d) Repeat up to 2a
>>
>> 3) Finish
>>
>>
>> I've added a step 0) which is "initialize entire quadtree", which
>>
>> means all of step 2 becomes "perform sum of big array on all procs."
>>
>> This has good and bad elements: we're still doing a lot of heavy
>>
>> communication across processors, but it will be managed by the MPI
>>
>> implementation instead of by yt.  Also, we avoid all of the costly
>>
>> serialize/deserialize procedures.  So for a given dataset, step 0 will
>>
>> be fixed in cost, but step 1 will be reduced as the number of
>>
>> processors goes up.  Step 2, which now is a single (or two)
>>
>> communication steps, will increase in cost with increasing number of
>>
>> processors.
>>
>>
>> So, it's not clear that this will *actually* be helpful or not.  It
>>
>> needs testing, and I've pushed it here:
>>
>>
>> bb://MatthewTurk/yt/
>>
>> hash 3f39eb7bf468
>>
>>
>> If anybody out there could test it, I'd be might glad.  This is the
>>
>> script I've been using:
>>
>>
>> http://paste.yt-project.org/show/2343/
>>
>>
>> I'd *greatly* appreciate testing results -- particularly for proc
>>
>> combos like 1, 2, 4, 8, 16, 32, 64, ... .  On my machine, the results
>>
>> are somewhat inconclusive.  Keep in mind you'll have to run with the
>>
>> option:
>>
>>
>> --config serialize=False
>>
>>
>> to get real results.  Here's the shell command I used:
>>
>>
>> ( for i in 1 2 3 4 5 6 7 8 9 10 ; do mpirun -np ${i} python2.7 proj.py
>>
>> --parallel --config serialize=False ; done ) 2>&1 | tee proj_new.log
>>
>>
>> Comparison against results from the old method would also be super
>>
>> helpful.
>>
>>
>> The alternate idea that I'd had was a bit different, but harder to
>>
>> implement, and also with a glaring problem.  The idea would be to
>>
>> serialize arrays, do the butterfly reduction, but instead of
>>
>> converting into data objects simply progressively walk hilbert
>>
>> indices.  Unfortunately this only works for up to 2^32 effective size,
>>
>> which is not going to work in a lot of cases.
>>
>>
>> Anyway, if this doesn't work, I'd be eager to hear if anybody has any
>>
>> ideas.  :)
>>
>>
>> -Matt
>>
>> _______________________________________________
>>
>> yt-dev mailing list
>>
>> yt-dev at lists.spacepope.org
>>
>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>
>>
>>
>>
>>
>> _______________________________________________
>>
>> yt-dev mailing list
>>
>> yt-dev at lists.spacepope.org
>>
>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>
>>
>> _______________________________________________
>> yt-dev mailing list
>> yt-dev at lists.spacepope.org
>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>
>> !DSPAM:10175,4fa3c167802953112396!
>>
>>
>>
>> _______________________________________________
>> yt-dev mailing list
>> yt-dev at lists.spacepope.org
>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>