[Yt-dev] Projection speed improvement patch

Sun Nov 8 15:05:53 PST 2009

Hi John,

> yt was using slightly more memory on ranger at 2.1GB/core, which isn't bad
> at all.  This pushed me over the 2GB/core limit on ranger, so I had to use 8
> cores/node instead of 16.

Oh, hm, interesting...

> However, it was slower by a factor of 2.5.  It took 1084 seconds from start
> to finish (including all of the overhead).  I had already created a binary
> hierarchy beforehand.  Ranger in general is slow (I suspect its
> interconnect), so maybe it's just a "feature" of ranger.

Okay.  So that is something that I wonder if we can improve --
particularly since you're already seeing that you need more cores to
run anyway.  Right now, the mechanism for reading data goes something
like this:

Projection:
For each level:
    identify grids on this level
    read all grids
        for file in all_files_for_these_grids:
            H5Fopen(File)
            for each grid in this file: H5Dread(each data set for this grid)

So for each file that appears on a given level, the corresponding CPU
file is only H5Fopen'd once -- which, with large lustre systems,
should help out.  (However, it does do multiple, potentially very
small, H5Dreads -- but I think we might be able to coalesce these the
same way enzo (optionally) can, with the H5P_DATASET_XFER property
type, since we're reading into void*'s that should exist through the
entirety of the C function.

However, one other option would be to allow the projections to preload
the entire dataset, rather than just the files needed for that level.
If we assume complete grid locality, then our level-by-level *could*
have roughly (N_enzo_cpus)/(N_yt_cpus) * N_levels H5Fopens, but it
could be a lot worse with the standard enzo load balancing.

The projections parallelize by 2D domain decomp, and they define their
regions right away.  So if we were to preload the entire dataset,
rather than level-by-level, we'd use more memory but we'd have fewer
H5Fopen calls (which, again, I'm told are the most expensive part of
lustre data access.)

I've created a patch that handles both of these things, and it's in
the hierarchy-opt branch as hash 801b378a22f7.  Because this touches
the C code, this requires a new install or develop (depending on how
you installed the first time.)  If you get a chance, could you let me
know if this improves things?  I think maybe modifying the exact
buffer size inside yt/lagos/HDF5LightReader.c may adjust things as
well.

(I was unable to test if this worked any better, as triton was down...)

You can change the mechanism for preloading by setting the argument
preload_style to either "level" (currently the default") or "all"
(where it loads the entire source that it "owns").  This can be passed
through the call to add_projection:

pc.add_projection("Density", 0, preload_style='all')

> Somewhat related but -- The Alltoallv call was failing when I compiled
> mpi4py with openmpi, but this went away when I compiled it with mvapich.  If

dang it.  This looks like I'm just passing around arrays that are too
big.  I think for this I might need some help from other people about
the right way to do this...  Ideas, anybody?

-Matt