[yt-dev] halo finding overhaul

Matthew Turk matthewturk at gmail.com
Thu Nov 8 19:13:48 PST 2012


Hi Stephen,

Thanks for writing and summarizing this!

On Thu, Nov 8, 2012 at 7:38 PM, Stephen Skory <s at skory.us> wrote:
> Hi everyone,
>
> Matt, Chris Moody, and myself have started a bit of a conversation
> off-list about overhauling the halo finding machinery that we would
> like to bring here. This has been sparked by some recent work on the
> Rockstar integration into yt. Matt will surely be able to summarize
> these issues better, but here's how I understand things:
>
> - As Matt wrote the halo finding stuff years ago, it isn't very
> scalable to high levels of parallelism. In particular, halo object
> information isn't as shared across parallel processes as they should
> be. For some of the halo finders, data is localized to the processor
> that "owns" the halo. When running in parallel, the local values need
> to be broadcast to all the other processors when accessed. This is why
> you get the halo mass with a function "halo.total_mass()" instead of a
> simple class value "halo.total_mass". The "total_mass()" function (and
> other halo object functions) is wrapped with a decorator that
> broadcasts the value transparently.

Yes, that's right.  So this was done for what I thought were very good
reasons five years ago, but which in retrospect were mistaken.  If you
conceptually break up the process of finding halos into a couple
different steps, you could say:

 * Find the halos
 * Return halos to script/user
 * Analyze each halo individually
 * Synthesize this

What this looks like for yt right now is roughly:

halos = HaloFinder(pf)
for halo in halos:
    mass = halo.total_mass()

all_masses.something()

Now, this is all predicated on the idea that we want to do something
really in-depth to each individual halo, this makes sense.  But
usually, we don't!  In fact, for the most part, all of the things we
want to do to a halo can be enumerated in advance, and really, that
list isn't terribly long.  Additionally, the halos all carry around
particles.  But, we don't pass those particles around -- which means
that we (as Stephen noted) sequentially analyze each one.

But, there are a few things we *do* want to do cross-correlation
between halos for.  So for that, we will want to keep particles in
memory.  But those items are relatively straightforward to do later.

So something like this actually probably won't work at all:

halos = HaloFinder(pf)
for halo in parallel_objects(halos):
    mass = halo.total_mass()

...because what's going on is that Proc1 might be waiting for a
message from Proc2 for the mass of Halo403, but it won't show up,
because Proc2 is busy waiting for Proc3 to send Halo343.

In principle though, the big issue is that the halos are one of the
few places left in yt where we're not encouraging yt to behave as
though data were acting in a *stream*.  With .quantities[],
projections, slices, all of that -- we ask the scripts to focus on
data *not* persisting, and being parallelizable.  With halos, we are
asking people to try to keep the particles around.

So I would very much like to precompute the vast majority of items,
and return *lightweight*, *data-free* halos back to the scripts.  I
think that nearly all, if not all, operations that are done on halos
can either be pre-computed or are rare enough that dumping data to
disk and loading it later is viable.

>
> I think it should be fairly straightforward to ensure that all halos
> objects on all tasks know all the key bits of info, but it's not clear
> to me how particle access would work.

I think we want to discourage particles access.  In fact, I think we
should discard particles from memory after the halos have been
identified.

>
> - Matt has also raised the idea of creating "callbacks" for halo
> finders. I don't understand this quite as well, but I think it means
> that it would be possible to write functions that would operate on the
> particles as the halo finder runs (or is finished finding halos but
> before it returns) to ensure good parallelism and speed. He has also
> mentioned that it would be possible for these callbacks to pull grid
> information from Python if that was needed for the analysis.

You've basically got the right idea.  The idea is that you supply a
series of quantities you want to measure, like you would a derived
field, and these get called on each halo.  So for instance:

def center_of_mass(pf, particle_data):
    ...

halos = HaloFinder(pf, ..., callbacks = [center_of_mass, ...])

Then you'd get it back out.  Of course, we'd supply default ones.  And
this would also be a vehicle to writing out the particles to disk, as
you could simply supply a particle output callback.

In this way, we'd keep the particles *only as long as we need to*, and
then discard them, but we'd still have full access.

>
> What I don't quite understand here is where these callbacks would
> actually operate? In c for HOP, in cython for Rockstar, and Python for
> parallel HOP? Or would they happen between what the halo finder
> returns and building the halo lists?

They'd happen between the halo finder returning and building the
lists, and I would say any callable could be supplied.

Does this make sense?

Mainly, I want to get rid of the over-generality that requires
retaining a large amount of data, blows out our chance at scalability,
and is probably barely used in a way that couldn't be done in the way
we can support here.

-Matt

>
> I probably have more things to say and ask, but I think I'll let Matt come in.
>
> --
> Stephen Skory
> s at skory.us
> http://stephenskory.com/
> 510.621.3687 (google voice)
> _______________________________________________
> yt-dev mailing list
> yt-dev at lists.spacepope.org
> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org



More information about the yt-dev mailing list