[yt-dev] Pre-YTEP proposal: eliminate a global mesh for particles

Fri May 27 16:05:02 PDT 2016

Hi all,

As some of you know, Meagan Lang and I have been working on a new
system for indexing particles quickly.  I'm writing an email today as
a pre-YTEP phase of getting feedback before moving forward on a plan
of action.

As of yt 3.3 (dev), the way particles work is as follows:

 * During ingest, build a global octree mesh.  (This is problem #0.)
This requires loading a uint64 into memory for every single particle,
sorting it, and then building an octree.  The octree construction is
pretty fast; the sorting is one of the slower parts, and a uint64 for
every particle is a pretty bad limiter.  (1024^3 => 8 gigs for the
array, then sorting.)
 * A `regions` object is stored on the index, which is of dimensions
(min(N_data_files,256))^3.  This stores a bitmask of which files fall
into each cell in that conceptual unigrid object.
 * The `regions` object gets queried whenever an object is selected,
and then it gets used to identify which data files should be loaded.
 * Whenever smoothing happens, 100% of the particles from the files
(for the data object in question) are loaded into memory at once.
These are queried, etc, through a "gather" method.  Each mesh point
that is being evaluated finds its neighbors, then smoothes onto
itself.  If you make a projection of the full data set, with a
standard `ProjectionPlot`, all of these things happen.
 * Parallelism does not help SPH datasets for any smoothing operation
or any `spatial` chunk operation.  Parallelism does not help the
indexing operation.

This is problematic in a number of ways; the biggest, most obvious way
is that every time the dataset is loaded, a mesh octree is made, this
takes up a ton of RAM, and it only gets used for smoothing deposition
operations.  And, it's slow.

And one thing I want to make very clear at this point: using it in
this way is not ideal.  This mesh octree is supposed to provide an
estimate of the density in a given location; then we smooth onto that.
This preserves the "project once, reuse many" idea that we've had in
yt in the past.  (For that particular purpose, it's fine.)  But, since
we use an MX Octree, which by its nature is not balanced by particle
count, we end up with empty zones.  Furthermore, during smoothing
operations, we can run into cases where we under-find particles.
(This can result in bad results for smoothing; one solution I'm
exploring for this is using a balanced tree like a kD-tree, which is
the obvious solution that I'm embarrassed I haven't explored before.)
We aren't being as true to the data as we should be when we do this.
Plus, if we do this for projections, we're making a *ton* more work
than we need to.  We build an octree, smooth onto the octree, then
immediately collapse it to a quad tree.  (Yes, really.)

Meagan has been pushing forward work to implement a fast,
memory-conservative indexing system.  It works, and it works really
well, for indexing specific regions of the fileset that make up a
dataset.  The idea is to develop a hierarchical indexing system that
does not presuppose there is any sorting of particles within the
files, but which can benefit from such a sorting.  It creates two
levels of compressed bitmap indices, one for each data file (or
sub-chunk of a data file), where the second level is sparse and only
exists in regions where more than one data file has particles.

What we've discovered when attempting to apply this to smoothing, is
that the biggest problem comes from the notion that there is a single,
fiducial "mesh" that exists everywhere.  What I mean by that is that
if such a mesh existed, we could chunk over each file in whatever
order we want, applying particles and smoothing them and whatnot.
Generating such a mesh in a memory efficient way is very difficult,
and we have yet to identify a mechanism by which we can do that
without resulting in somewhere between N and N^2 reads of files for
every smoothing operation.  Either we have a single, global mesh
(which, unless we're wrong, requires either multiple passes over files
until convergence *or* a full in-memory set of particles) or we do
lots of over-reads to make sure that the mesh isn't dependent on which
file we are reading and when.  An issue with this which comes up
immediately is that if we don't have a fiducial mesh, then depositing
a particle or smoothing a particle will give different results (which
cannot be combined in a well-defined way) depending on the order it is
done.

What Meagan's built gives us a fast index, as well as an estimate of
where regions of high particle count can be found; it further can give
us a really good idea of how to build a quadtree that over-resolves
structure.  (Comes in handy below.)

I propose we eliminate the notion of a fixed global mesh for particles.

There are a few places in the code that *right now* assume a global mesh:

 * Projection and slice plots of "smoothed" quantities
 * Smoothing operations that are not on user-defined (`arbitrary_grid`) meshes
 * Volume renderings (which don't really work anyway)
 * Ray tracing individual rays (particularly important for light ray
and trident)

I'll address each of these individually, but I want to start with this
statement: we should not make assumptions about global meshes, but we
should make such assumptions and decisions *available* to the people
using the code.  This means, for instance, that we should continue to
implement and make available smoothing operations, but we expose them
through an `arbitrary_grid` object.

Projections and slices of non-local particle quantities (i.e.,
density, temperature, etc, of gas particles) are presently handled by
smoothing onto a mesh.  We should instead be doing some method of
pixelizing these directly, either onto a quadtree that can be reused,
or onto a buffer directly.  Ting-Wai To, an undergrad at Illinois, has
built an implementation in Cython of a flat pixelizer that's
reasonably fast.  What I would propose is using the quad tree mesh
found during the indexing step, and then smooth over the particles
onto this mesh directly.  This means we can use *just* an `io` chunk,
rather than (implicitly) a `spatial` chunk for SPH particles when
smoothing them.  Integrating the overall SPH smoothing into the
visualization layer, rather than the data access layer, also means
that we can rotate particles more easily.

Smoothing operations that are not on user-defined meshes, such as
computing the total density inside a sphere, are currently using the
same method as above.  Two main uses of this happen -- computing some
quantity and then doing a radial profile of it, and computing a
density or something and comparing it to another quantity.  However,
it is becoming increasingly clear that this is not what most people
using SPH codes *want*.  It is not a particularly bad approximation as
long as the resolution is high enough, but it's also not particularly
good, and it is lossy.  Instead, we should be regarding the particles
as particles, and computing profiles on *those*.  This could cause
issues in regimes where a particle crosses bin boundaries when the
mesh point doesn't, but I would assert that this is an ill-posed
operation to begin with.

Volume rendering at present is problematic.  An octree is really the
only way we *could* do things, but it's also not terribly good for it.
We'd be smoothing, volume rendering, then visualizing.  The only
reason we do it this way is because we only have a VolumeSource, and
no SmoothedParticleSource.  We should instead implement a
SmoothedParticleSource, which would itself be considerably faster and
easier to implement, I would wager, than forcing the issue in the
opposite direction.  There are remnants of this in the code from star
particle rendering which could be used in this case.

Finally, and the one I'm the most worried about, is ray tracing of
individual rays.  We need to ensure that we are able to support the
needs of communities that build on yt functionality, such as trident.
I believe, but may be incorrect, there are essentially two modes we
need to support here.  The first is specific query points, and the
other is aggregate sums.  Query points are where you evaluate the
value of some quantity at multiple points along a ray running from A
.. B, and aggregates are where you evaluate the sum total of some
value along the ray.

In the former case, for query points, we actually can rely on existing
functionality rather nicely.  We would be able to specify points (or
even implicitly choose them to be at some regular interval or
something), evaluate smoothing kernel at those points, and return
those values.  This is identical to how it operates now, except that
the points would need to be specified in some manner rather than
simply being the locations that the ray traverses an oct cell.  I
believe this would improve the fidelity of the calculation.

The latter case, for aggregation, could build on work such as that
done by SPLASH for quickly evaluating integrals along lines of sight.
This would also result in a better conservation of quantities, as the
integral would have sample points defined by the extent of the
smoothing kernel, rather than wherever the particle happened to be
compared to the mesh points.

The fallback operation in all of these is to construct an
`arbitrary_grid` object and smooth or deposit there.  And, we will
still make available the ability to generate a "full" octree for when
you really need one; this will just no longer be the default, and to
get at it you might have to work a little.

I am of the belief, after having been convinced by compelling evidence
around performance, memory use and data fidelity, that we should do
away with a single global mesh and instead move to having fast point
selection and indexing, coupled with pushing the data representation
close to the real data instead of faking it to look like particles.
This is probably the most invasive change, especially as it undoes a
lot of things I worked really hard on implementing and optimizing, but
I think if we continue to do things as they are, we're just going to
be stuck.  We could also evaluate having a global mesh as an option,
but for deposition/smoothing, it's not going to be guaranteed to be
optimal or correct.

I'm not going to propose any syntactic sugar or anything of the like
in this email, although that certainly can/should come up in the YTEP.
(For instance, .grid() on a data object, perhaps?  But we also already
have ds.r[] to get arbitrary grids, too!)  But, I wanted to send this
to the list so that I can get a sense of any *very* broad brush
feedback before moving forward with a YTEP, and before we start
adjusting the code (in advance of a PR) to move in this direction.
I'm certainly not proposing we do this for 3.3 (or rather, that we
hold 3.3 for this.)  If anyone has any comments, opinions (strong or
fleeting), suggestions, or if you maybe know how to fix all this stuff
in a way that Meagan and I haven't thought of, we'd be happy to hear
them.

Thanks for reading this far,

Matt