[yt-dev] Call for testing: Projection performance

Sat May 5 19:13:40 PDT 2012

Hi Matt,

I wanted to let you know that I'll try this out as part of the L7 work. I've just been blocked from SSH for a few days.

--Rick

On May 4, 2012, at 4:28 AM, "Matthew Turk" <matthewturk at gmail.com> wrote:

> Hi Sam,
> 
> Thanks a ton.  This looks good to me, seeing as how at few tasks we
> have the overhead of creating the tree, and at many tasks we'll have
> collective operations.  I'll try to get ahold of another testing
> machine and then I'll issue a PR.  (And close Issue #348!)
> 
> -Matt
> 
> On Thu, May 3, 2012 at 6:47 PM, Sam Skillman <samskillman at gmail.com> wrote:
>> Meant to include the scaling image.
>> 
>> 
>> On Thu, May 3, 2012 at 4:44 PM, Sam Skillman <samskillman at gmail.com> wrote:
>>> 
>>> Hi Matt & friends,
>>> 
>>> I tested this on a fairly large nested simulation with about 60k grids
>>> using 6 nodes of Janus (dual-hex nodes) and ran from 1 to 64 processors.  I
>>> got fairly good scaling and made a quick mercurial repo on bitbucket with
>>> everything except the dataset needed to do a similar
>>> study. https://bitbucket.org/samskillman/quad-tree-proj-performance
>>> 
>>> Raw timing:
>>> projects/quad_proj_scale:more perf.dat
>>> 64 2.444e+01
>>> 32 4.834e+01
>>> 16 7.364e+01
>>> 8 1.125e+02
>>> 4 1.853e+02
>>> 2 3.198e+02
>>> 1 6.370e+02
>>> 
>>> A few notes:
>>> -- I ran with 64 cores first, then again so that the disks were somewhat
>>> warmed up, then only used the second timing of the 64 core run.
>>> -- While I did get full nodes, the machine doesn't have a ton of I/O nodes
>>> so in an ideal setting performance may be even better.
>>> -- My guess would be that a lot of this speedup comes from having a
>>> parallel filesystem, so you may not get as great of speedups on your laptop.
>>> -- Speedup from 32 to 64 is nearly ideal...this is great.
>>> 
>>> This looks pretty great to me, and I'd +1 any PR.
>>> 
>>> Sam
>>> 
>>> On Thu, May 3, 2012 at 1:42 PM, Matthew Turk <matthewturk at gmail.com>
>>> wrote:
>>>> 
>>>> Hi all,
>>>> 
>>>> I implemented this "quadtree extension" that duplicates the quadtree
>>>> on all processors, which may make it nicer to scale projections.
>>>> Previously the procedure was:
>>>> 
>>>> 1) Locally project
>>>> 2) Merge across procs:
>>>>  2a) Serialize quadtree
>>>>  2b) Point-to-point communciate
>>>>  2c) Deserialize
>>>>  2d) Merge local and remote
>>>>  2d) Repeat up to 2a
>>>> 3) Finish
>>>> 
>>>> I've added a step 0) which is "initialize entire quadtree", which
>>>> means all of step 2 becomes "perform sum of big array on all procs."
>>>> This has good and bad elements: we're still doing a lot of heavy
>>>> communication across processors, but it will be managed by the MPI
>>>> implementation instead of by yt.  Also, we avoid all of the costly
>>>> serialize/deserialize procedures.  So for a given dataset, step 0 will
>>>> be fixed in cost, but step 1 will be reduced as the number of
>>>> processors goes up.  Step 2, which now is a single (or two)
>>>> communication steps, will increase in cost with increasing number of
>>>> processors.
>>>> 
>>>> So, it's not clear that this will *actually* be helpful or not.  It
>>>> needs testing, and I've pushed it here:
>>>> 
>>>> bb://MatthewTurk/yt/
>>>> hash 3f39eb7bf468
>>>> 
>>>> If anybody out there could test it, I'd be might glad.  This is the
>>>> script I've been using:
>>>> 
>>>> http://paste.yt-project.org/show/2343/
>>>> 
>>>> I'd *greatly* appreciate testing results -- particularly for proc
>>>> combos like 1, 2, 4, 8, 16, 32, 64, ... .  On my machine, the results
>>>> are somewhat inconclusive.  Keep in mind you'll have to run with the
>>>> option:
>>>> 
>>>> --config serialize=False
>>>> 
>>>> to get real results.  Here's the shell command I used:
>>>> 
>>>> ( for i in 1 2 3 4 5 6 7 8 9 10 ; do mpirun -np ${i} python2.7 proj.py
>>>> --parallel --config serialize=False ; done ) 2>&1 | tee proj_new.log
>>>> 
>>>> Comparison against results from the old method would also be super
>>>> helpful.
>>>> 
>>>> The alternate idea that I'd had was a bit different, but harder to
>>>> implement, and also with a glaring problem.  The idea would be to
>>>> serialize arrays, do the butterfly reduction, but instead of
>>>> converting into data objects simply progressively walk hilbert
>>>> indices.  Unfortunately this only works for up to 2^32 effective size,
>>>> which is not going to work in a lot of cases.
>>>> 
>>>> Anyway, if this doesn't work, I'd be eager to hear if anybody has any
>>>> ideas.  :)
>>>> 
>>>> -Matt
>>>> _______________________________________________
>>>> yt-dev mailing list
>>>> yt-dev at lists.spacepope.org
>>>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>> 
>>> 
>> 
>> 
>> _______________________________________________
>> yt-dev mailing list
>> yt-dev at lists.spacepope.org
>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>> 
> _______________________________________________
> yt-dev mailing list
> yt-dev at lists.spacepope.org
> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org