[yt-dev] Call for testing: Projection performance
Sam Skillman
samskillman at gmail.com
Thu May 3 15:44:29 PDT 2012
Hi Matt & friends,
I tested this on a fairly large nested simulation with about 60k grids
using 6 nodes of Janus (dual-hex nodes) and ran from 1 to 64 processors. I
got fairly good scaling and made a quick mercurial repo on bitbucket with
everything except the dataset needed to do a similar study.
https://bitbucket.org/samskillman/quad-tree-proj-performance
Raw timing:
projects/quad_proj_scale:more perf.dat
64 2.444e+01
32 4.834e+01
16 7.364e+01
8 1.125e+02
4 1.853e+02
2 3.198e+02
1 6.370e+02
A few notes:
-- I ran with 64 cores first, then again so that the disks were somewhat
warmed up, then only used the second timing of the 64 core run.
-- While I did get full nodes, the machine doesn't have a ton of I/O nodes
so in an ideal setting performance may be even better.
-- My guess would be that a lot of this speedup comes from having a
parallel filesystem, so you may not get as great of speedups on your laptop.
-- Speedup from 32 to 64 is nearly ideal...this is great.
This looks pretty great to me, and I'd +1 any PR.
Sam
On Thu, May 3, 2012 at 1:42 PM, Matthew Turk <matthewturk at gmail.com> wrote:
> Hi all,
>
> I implemented this "quadtree extension" that duplicates the quadtree
> on all processors, which may make it nicer to scale projections.
> Previously the procedure was:
>
> 1) Locally project
> 2) Merge across procs:
> 2a) Serialize quadtree
> 2b) Point-to-point communciate
> 2c) Deserialize
> 2d) Merge local and remote
> 2d) Repeat up to 2a
> 3) Finish
>
> I've added a step 0) which is "initialize entire quadtree", which
> means all of step 2 becomes "perform sum of big array on all procs."
> This has good and bad elements: we're still doing a lot of heavy
> communication across processors, but it will be managed by the MPI
> implementation instead of by yt. Also, we avoid all of the costly
> serialize/deserialize procedures. So for a given dataset, step 0 will
> be fixed in cost, but step 1 will be reduced as the number of
> processors goes up. Step 2, which now is a single (or two)
> communication steps, will increase in cost with increasing number of
> processors.
>
> So, it's not clear that this will *actually* be helpful or not. It
> needs testing, and I've pushed it here:
>
> bb://MatthewTurk/yt/
> hash 3f39eb7bf468
>
> If anybody out there could test it, I'd be might glad. This is the
> script I've been using:
>
> http://paste.yt-project.org/show/2343/
>
> I'd *greatly* appreciate testing results -- particularly for proc
> combos like 1, 2, 4, 8, 16, 32, 64, ... . On my machine, the results
> are somewhat inconclusive. Keep in mind you'll have to run with the
> option:
>
> --config serialize=False
>
> to get real results. Here's the shell command I used:
>
> ( for i in 1 2 3 4 5 6 7 8 9 10 ; do mpirun -np ${i} python2.7 proj.py
> --parallel --config serialize=False ; done ) 2>&1 | tee proj_new.log
>
> Comparison against results from the old method would also be super helpful.
>
> The alternate idea that I'd had was a bit different, but harder to
> implement, and also with a glaring problem. The idea would be to
> serialize arrays, do the butterfly reduction, but instead of
> converting into data objects simply progressively walk hilbert
> indices. Unfortunately this only works for up to 2^32 effective size,
> which is not going to work in a lot of cases.
>
> Anyway, if this doesn't work, I'd be eager to hear if anybody has any
> ideas. :)
>
> -Matt
> _______________________________________________
> yt-dev mailing list
> yt-dev at lists.spacepope.org
> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.spacepope.org/pipermail/yt-dev-spacepope.org/attachments/20120503/a914caf2/attachment.html>
More information about the yt-dev
mailing list