Hi Matt & friends,<div><br></div><div>I tested this on a fairly large nested simulation with about 60k grids using 6 nodes of Janus (dual-hex nodes) and ran from 1 to 64 processors.  I got fairly good scaling and made a quick mercurial repo on bitbucket with everything except the dataset needed to do a similar study. <a href="https://bitbucket.org/samskillman/quad-tree-proj-performance">https://bitbucket.org/samskillman/quad-tree-proj-performance</a></div>


<div><br></div><div>Raw timing:</div><div><div>projects/quad_proj_scale:more perf.dat </div><div>64 2.444e+01</div><div>32 4.834e+01</div><div>16 7.364e+01</div><div>8 1.125e+02</div><div>4 1.853e+02</div><div>2 3.198e+02</div>


<div>1 6.370e+02</div></div><div><br></div><div>A few notes: </div><div>-- I ran with 64 cores first, then again so that the disks were somewhat warmed up, then only used the second timing of the 64 core run.</div><div>-- While I did get full nodes, the machine doesn't have a ton of I/O nodes so in an ideal setting performance may be even better.</div>


<div>-- My guess would be that a lot of this speedup comes from having a parallel filesystem, so you may not get as great of speedups on your laptop.</div><div>-- Speedup from 32 to 64 is nearly ideal...this is great.</div>


<div><br></div><div>This looks pretty great to me, and I'd +1 any PR.  </div><div><br></div><div>Sam</div><div><br></div><div><div class="gmail_quote">On Thu, May 3, 2012 at 1:42 PM, Matthew Turk <span dir="ltr"><<a href="mailto:matthewturk@gmail.com" target="_blank">matthewturk@gmail.com</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi all,<br>

<br>

I implemented this "quadtree extension" that duplicates the quadtree<br>

on all processors, which may make it nicer to scale projections.<br>

Previously the procedure was:<br>

<br>

1) Locally project<br>

2) Merge across procs:<br>

  2a) Serialize quadtree<br>

  2b) Point-to-point communciate<br>

  2c) Deserialize<br>

  2d) Merge local and remote<br>

  2d) Repeat up to 2a<br>

3) Finish<br>

<br>

I've added a step 0) which is "initialize entire quadtree", which<br>

means all of step 2 becomes "perform sum of big array on all procs."<br>

This has good and bad elements: we're still doing a lot of heavy<br>

communication across processors, but it will be managed by the MPI<br>

implementation instead of by yt.  Also, we avoid all of the costly<br>

serialize/deserialize procedures.  So for a given dataset, step 0 will<br>

be fixed in cost, but step 1 will be reduced as the number of<br>

processors goes up.  Step 2, which now is a single (or two)<br>

communication steps, will increase in cost with increasing number of<br>

processors.<br>

<br>

So, it's not clear that this will *actually* be helpful or not.  It<br>

needs testing, and I've pushed it here:<br>

<br>

bb://MatthewTurk/yt/<br>

hash 3f39eb7bf468<br>

<br>

If anybody out there could test it, I'd be might glad.  This is the<br>

script I've been using:<br>

<br>

<a href="http://paste.yt-project.org/show/2343/" target="_blank">http://paste.yt-project.org/show/2343/</a><br>

<br>

I'd *greatly* appreciate testing results -- particularly for proc<br>

combos like 1, 2, 4, 8, 16, 32, 64, ... .  On my machine, the results<br>

are somewhat inconclusive.  Keep in mind you'll have to run with the<br>

option:<br>

<br>

--config serialize=False<br>

<br>

to get real results.  Here's the shell command I used:<br>

<br>

( for i in <a href="tel:1%202%203%204%205%206%207%208%209%2010" value="+12345678910">1 2 3 4 5 6 7 8 9 10</a> ; do mpirun -np ${i} python2.7 proj.py<br>

--parallel --config serialize=False ; done ) 2>&1 | tee proj_new.log<br>

<br>

Comparison against results from the old method would also be super helpful.<br>

<br>

The alternate idea that I'd had was a bit different, but harder to<br>

implement, and also with a glaring problem.  The idea would be to<br>

serialize arrays, do the butterfly reduction, but instead of<br>

converting into data objects simply progressively walk hilbert<br>

indices.  Unfortunately this only works for up to 2^32 effective size,<br>

which is not going to work in a lot of cases.<br>

<br>

Anyway, if this doesn't work, I'd be eager to hear if anybody has any ideas.  :)<br>

<br>

-Matt<br>

_______________________________________________<br>

yt-dev mailing list<br>

<a href="mailto:yt-dev@lists.spacepope.org">yt-dev@lists.spacepope.org</a><br>

<a href="http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org" target="_blank">http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org</a><br>

</blockquote></div><br></div>