[yt-dev] Call for testing: Projection performance

Thu May 3 12:42:17 PDT 2012

Hi all,

I implemented this "quadtree extension" that duplicates the quadtree
on all processors, which may make it nicer to scale projections.
Previously the procedure was:

1) Locally project
2) Merge across procs:
  2a) Serialize quadtree
  2b) Point-to-point communciate
  2c) Deserialize
  2d) Merge local and remote
  2d) Repeat up to 2a
3) Finish

I've added a step 0) which is "initialize entire quadtree", which
means all of step 2 becomes "perform sum of big array on all procs."
This has good and bad elements: we're still doing a lot of heavy
communication across processors, but it will be managed by the MPI
implementation instead of by yt.  Also, we avoid all of the costly
serialize/deserialize procedures.  So for a given dataset, step 0 will
be fixed in cost, but step 1 will be reduced as the number of
processors goes up.  Step 2, which now is a single (or two)
communication steps, will increase in cost with increasing number of
processors.

So, it's not clear that this will *actually* be helpful or not.  It
needs testing, and I've pushed it here:

bb://MatthewTurk/yt/
hash 3f39eb7bf468

If anybody out there could test it, I'd be might glad.  This is the
script I've been using:

http://paste.yt-project.org/show/2343/

I'd *greatly* appreciate testing results -- particularly for proc
combos like 1, 2, 4, 8, 16, 32, 64, ... .  On my machine, the results
are somewhat inconclusive.  Keep in mind you'll have to run with the
option:

--config serialize=False

to get real results.  Here's the shell command I used:

( for i in 1 2 3 4 5 6 7 8 9 10 ; do mpirun -np ${i} python2.7 proj.py
--parallel --config serialize=False ; done ) 2>&1 | tee proj_new.log

Comparison against results from the old method would also be super helpful.

The alternate idea that I'd had was a bit different, but harder to
implement, and also with a glaring problem.  The idea would be to
serialize arrays, do the butterfly reduction, but instead of
converting into data objects simply progressively walk hilbert
indices.  Unfortunately this only works for up to 2^32 effective size,
which is not going to work in a lot of cases.

Anyway, if this doesn't work, I'd be eager to hear if anybody has any ideas.  :)

-Matt