[yt-dev] Call for testing: Projection performance

Fri May 4 17:43:11 PDT 2012

Hi all,

I just did a scaling test on Pleiades at NASA Ames and got somewhat worse scaling at high processor counts.  This is with one Matt's datasets so that might be the issue.

Here's some summary data for the hierarchy: http://paste.yt-project.org/show/2348/

I actually found superlinear scaling going from 1 processor to 2 processors so I made two different scaling plots.  I think the second plot (assuming the 2 core run is representative of the true serial performance) is probably more accurate.

http://imgur.com/5pQ2P

http://imgur.com/pOKty

Since I was running these jobs interactively, I was able to get a pretty good feel for which parts of the calculation were most time-consuming.  As the plots above show, beyond 8 processors, the projection operation was so fast that increasing the processor count really didn't help much.  Most of the overhead is in parsing the hierarchy, setting up the MPI communicators, and communicating and assembling the projection on all processors at the end, in rough order of importance.

This is also quite memory intensive -  each core (independent of the global number of cores) needed at least 4.5 gigabytes of memory.

Nathan Goldbaum
Graduate Student
Astronomy & Astrophysics, UCSC
goldbaum at ucolick.org
http://www.ucolick.org/~goldbaum

On May 4, 2012, at 4:20 AM, Matthew Turk wrote:

> Hi Sam,
> 
> Thanks a ton.  This looks good to me, seeing as how at few tasks we
> have the overhead of creating the tree, and at many tasks we'll have
> collective operations.  I'll try to get ahold of another testing
> machine and then I'll issue a PR.  (And close Issue #348!)
> 
> -Matt
> 
> On Thu, May 3, 2012 at 6:47 PM, Sam Skillman <samskillman at gmail.com> wrote:
>> Meant to include the scaling image.
>> 
>> 
>> On Thu, May 3, 2012 at 4:44 PM, Sam Skillman <samskillman at gmail.com> wrote:
>>> 
>>> Hi Matt & friends,
>>> 
>>> I tested this on a fairly large nested simulation with about 60k grids
>>> using 6 nodes of Janus (dual-hex nodes) and ran from 1 to 64 processors.  I
>>> got fairly good scaling and made a quick mercurial repo on bitbucket with
>>> everything except the dataset needed to do a similar
>>> study. https://bitbucket.org/samskillman/quad-tree-proj-performance
>>> 
>>> Raw timing:
>>> projects/quad_proj_scale:more perf.dat
>>> 64 2.444e+01
>>> 32 4.834e+01
>>> 16 7.364e+01
>>> 8 1.125e+02
>>> 4 1.853e+02
>>> 2 3.198e+02
>>> 1 6.370e+02
>>> 
>>> A few notes:
>>> -- I ran with 64 cores first, then again so that the disks were somewhat
>>> warmed up, then only used the second timing of the 64 core run.
>>> -- While I did get full nodes, the machine doesn't have a ton of I/O nodes
>>> so in an ideal setting performance may be even better.
>>> -- My guess would be that a lot of this speedup comes from having a
>>> parallel filesystem, so you may not get as great of speedups on your laptop.
>>> -- Speedup from 32 to 64 is nearly ideal...this is great.
>>> 
>>> This looks pretty great to me, and I'd +1 any PR.
>>> 
>>> Sam
>>> 
>>> On Thu, May 3, 2012 at 1:42 PM, Matthew Turk <matthewturk at gmail.com>
>>> wrote:
>>>> 
>>>> Hi all,
>>>> 
>>>> I implemented this "quadtree extension" that duplicates the quadtree
>>>> on all processors, which may make it nicer to scale projections.
>>>> Previously the procedure was:
>>>> 
>>>> 1) Locally project
>>>> 2) Merge across procs:
>>>>  2a) Serialize quadtree
>>>>  2b) Point-to-point communciate
>>>>  2c) Deserialize
>>>>  2d) Merge local and remote
>>>>  2d) Repeat up to 2a
>>>> 3) Finish
>>>> 
>>>> I've added a step 0) which is "initialize entire quadtree", which
>>>> means all of step 2 becomes "perform sum of big array on all procs."
>>>> This has good and bad elements: we're still doing a lot of heavy
>>>> communication across processors, but it will be managed by the MPI
>>>> implementation instead of by yt.  Also, we avoid all of the costly
>>>> serialize/deserialize procedures.  So for a given dataset, step 0 will
>>>> be fixed in cost, but step 1 will be reduced as the number of
>>>> processors goes up.  Step 2, which now is a single (or two)
>>>> communication steps, will increase in cost with increasing number of
>>>> processors.
>>>> 
>>>> So, it's not clear that this will *actually* be helpful or not.  It
>>>> needs testing, and I've pushed it here:
>>>> 
>>>> bb://MatthewTurk/yt/
>>>> hash 3f39eb7bf468
>>>> 
>>>> If anybody out there could test it, I'd be might glad.  This is the
>>>> script I've been using:
>>>> 
>>>> http://paste.yt-project.org/show/2343/
>>>> 
>>>> I'd *greatly* appreciate testing results -- particularly for proc
>>>> combos like 1, 2, 4, 8, 16, 32, 64, ... .  On my machine, the results
>>>> are somewhat inconclusive.  Keep in mind you'll have to run with the
>>>> option:
>>>> 
>>>> --config serialize=False
>>>> 
>>>> to get real results.  Here's the shell command I used:
>>>> 
>>>> ( for i in 1 2 3 4 5 6 7 8 9 10 ; do mpirun -np ${i} python2.7 proj.py
>>>> --parallel --config serialize=False ; done ) 2>&1 | tee proj_new.log
>>>> 
>>>> Comparison against results from the old method would also be super
>>>> helpful.
>>>> 
>>>> The alternate idea that I'd had was a bit different, but harder to
>>>> implement, and also with a glaring problem.  The idea would be to
>>>> serialize arrays, do the butterfly reduction, but instead of
>>>> converting into data objects simply progressively walk hilbert
>>>> indices.  Unfortunately this only works for up to 2^32 effective size,
>>>> which is not going to work in a lot of cases.
>>>> 
>>>> Anyway, if this doesn't work, I'd be eager to hear if anybody has any
>>>> ideas.  :)
>>>> 
>>>> -Matt
>>>> _______________________________________________
>>>> yt-dev mailing list
>>>> yt-dev at lists.spacepope.org
>>>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>> 
>>> 
>> 
>> 
>> _______________________________________________
>> yt-dev mailing list
>> yt-dev at lists.spacepope.org
>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>> 
> _______________________________________________
> yt-dev mailing list
> yt-dev at lists.spacepope.org
> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
> 
> !DSPAM:10175,4fa3c167802953112396!
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.spacepope.org/pipermail/yt-dev-spacepope.org/attachments/20120504/059aca67/attachment.html>