[Yt-dev] Parallelism notes and call for patch testing

Fri Sep 24 22:43:56 PDT 2010

Hi all,

I prepared some strong scaling plots today (and I'm actually quite
pleased with them -- I will share them next week) and discovered a
couple items of interest.

 * Profile1D is substantially slower than Profile2D.  I'm currently
investigating why, but my current working belief is that the usage of
pure-Python for conducting the histogram slows us down, versus the
hand-coded C routine I wrote for 2D and 3D histograms.
(Bin[23]DProfile in yt/utilities/data_point_utilities.c).  I'll
probably rewrite the 1D into C at some point.
 * Projections (old-style, not quadtree) scale much better than I
think I realized, up to ~64 processors.  At 128 on the dataset I
tested on (512^3 L7) the algorithmic overhead and processor starvation
combined to reduce scaling substantially.  The good news is that it
still only takes 40 seconds on 64 processors.  This was on Triton, 8x8
node topology.  I'm confident that on bigger datasets it would scale
further.
 * With the coalescing of grid/cpu reading in the yt-Enzo code IO is
not really an issue at the moment.
 * DerivedQuantities use _mpi_catlist, which I think I wrote something
like two and a half years ago.  At that time, I avoided any collective
communications or non-blocking combines, so it proceeded by pickling
lists that got passed over the wire (one by one) to the root
processor, where they were joined and then broadcast back.  This is
extremely slow.  The current mechanism that I think makes the most
sense is to use the _recv_arrays routine that manages an alltoallv
call.

I've prepared two pastes.  The first is a patch:

http://paste.enzotools.org/show/1242/

that converts the Derived Quantity parallel join to _mpi_catarrays and
converts _mpi_catarrays to use the alltoallv wrapper

The second is the test script I used:

http://paste.enzotools.org/show/1243/

which I was only able to test on my laptop this evening, as Triton's
disk is down for a bit.  It produced bitwise identical (once I handled
the transposes correctly!) between parallel and serial.

If anyone that uses parallelism in yt heavily has a chance I'd really
appreciate it if you could run this script on a dataset you've got.
If you've got other big parallel jobs that use the DerivedQuantities
or Slice mechanisms (as those are the ones that have been touched)
that would also be great to hear back about.

Thanks!

Matt