[yt-dev] RFC: SPH plan in yt 3.0

Tue Jan 8 07:18:07 PST 2013

Hi all,

I am writing today to request comments and suggestions for supporting
SPH data in yt.  This is part of a broader effort -- which includes
the native Octree support -- in yt 3.0 to correctly support
non-gridpatch data.

This email contains a lot of information about where we are, and I was
hoping to get some feedback from people who have more experience with
SPH data, Gadget, AREPO, etc.  In particular, (if they're around and
have the time) it'd be great to hear from Marcel Haas, Michael J.
Roberts, and Chris Moody.

Additionally, I am very interested in approaching this as an
iterative, incremental process.  Minimum Viable Product (MVP) first,
then improving as we learn and go along.

= Background =

As seen in this pull request:

https://bitbucket.org/yt_analysis/yt-3.0/pull-request/11/initial-n-body-data-support

I've implemented a first pass at N-body data support.  This proceeds
through the following steps:

  * Read header
  * Create octree
  * For each file in output:
    * Add particles to octree
    * When a cell crests a certain number of particles (or when
particles from more than one file are co-located) refine
  * Directly query particles to get data.

As it stands, for N-body data this is very poor at memory
conservation.  In fact, it's extremely poor.  I have however written
it so that in the future, we can move to a distributed memory octree,
which should alleviate difficulties.

For quantitative analysis of quantities that are *in the output*, this
already works.  It does not yet provide any kind of kernel or density
estimator.  By mandating refinement based on file that the particles
are in, load on demand operations can (more) efficiently read data.

Many of the related items are outlined in the corresponding YTEP:

https://yt.readthedocs.org/projects/ytep/en/latest/YTEPs/YTEP-0005.html

Currently, it works for OWLS data.  However, since there has been
something of a Gadget fragmentation, there is considerable difficulty
in "supporting Gadget" when that includes supporting the binary
format.  I've instead taken the approach of attempting to isolate
nearly all the functionality into classes so that, at worst, someone
could implement a couple functions, supply them to the Gadget
interface, and it will use those to read data from disk.  Some
integration work on this is still to come, but on some level it will
be a 'build your own' system for some particle codes.  OWLS data
(which was run with Gadget) on the other hand is in HDF5, is
self-describing, has a considerable number of metadata items
affiliated with it, and even outputs the Density estimate for the
hydro particles.

Where I'm now stuck is how to proceed with handling SPH data.  By
extension, this also applies somewhat to handling tesselation codes
like AREPO and PHURBAS.

= Challenges =

Analyzing quantitative data, where fields can be generated from
fully-local information, is not difficult.  We can in fact largely
consider this to be done.  This includes anything that does not
require applying a smoothing kernel.

The hard part comes in in two areas:

  * Processing SPH particles as fluid quantities
  * Visualizing results

Here are a few of our tools:

  * Data selection and reading: we can get rectangular prisms and read
them from disk relatively quickly.  Same for spheres.  We can also get
boolean regions of these and read them from disk.
  * Neighbor search in octrees.  Given Oct A at Level N, we can get
all M (<=26) Octs that neighbor A at Levels <= N.
  * Voronoi tesselations: we have a wrapper for Voro++ which is
relatively performant, but not amazing.  The DTFE method provides a
way to use this to our advantage, although this requires non-local
information.  (The naive approach, of getting density by dividing the
particle mass by the volume of the voronoi cell, does not give good
agreement at least for the OWLS data.)

Here's what we want to support:

  * Quantitative analysis (profiles, etc)
  * Visualization (Projections, slices, probably not VR yet)
  * Speed

== Gridding Data ==

In the past, one approach we have explored has been to simply grid all
of the values.  Assuming you have a well-behaved kernel and a large
number of particles, you can calculate at every cell-center the value
of the necessary fields.  If this were done, we could reduce the
complexity of the problem *after* gridding the data, but it could
potentially increase the complexity of the problem *before* gridding
the data.

For instance, presupposing we took our Octree and created blocks of
8^3 (like FLASH), we could then analyze the resultant eulerian grid
identically to other codes -- projections, slices, and so on would be
done exactly as elsewhere.

This does bring with it the challenge of conducting the smoothing
kernel.  Not only are there a handful of free parameters in the
smoothing kernel, but it also requires handling edge effects.
Individual octs are confined to a single domain; neighbors, however,
are not.  So we'd potentially be in a situation where to calculate the
values inside a single Oct, we would have to read from many different
files.  From the standpoint of implementing in yt, though, we could
very simply implement this using the new IO functions in 3.0.  We
already have a _read_particles function, so the _read_fluid function
would identify the necessary region, read the particles, and then
proceed to smooth them.  We would need to ensure that we were not
subject to edge effects (something Daniel Price goes into some detail
about in the SPLASH paper) which may prove tricky.

It's been impressed upon me numerous times, however, that this is a
sub-optimal solution.  So perhaps we should avoid it.

== Not Gridding Data ==

I spent some time thinking about this over the last little while, and
it's no longer obvious to me that we need to grid the data at all even
to live within the yt infrastructure.  What gridded data buys us is a
clear "extent" of fluid influence (i.e., a particle's influence may
extend beyond its center) and an easy way of transforming a fluid into
an image.  The former is hard.  But I think the latter is easier than
I realized before.

Recent, still percolating changes in how yt handles pixelization and
ray traversal should allow a simple mechanism for swapping out
pixelization functions based on characteristics of the dataset.  A
pixelization function takes a slice or projection data object (which
is, in fact, an *actual* data object and not an image) and then
converts them to an image.  For SPH datasets, we could mandate that
the pixelization object be swapped out for something that can apply an
SPH-appropriate method of pixelization (see Daniel Price's SPLASH
paper, or the GigaPan paper, for a few examples).  We would likely
also want to change out the Projection object, as well, and we may
need to eliminate the ability to avoid IO during the pixelization
procedure (i.e., the project-once-pixelize-many advantage of grid
data).  But I think it would be do-able.

So we could do this.  I think this is the harder road, but I think it
is also possible.  We would *still* need to come up with a mechanism
for applying the smoothing kernel.

= Conclusions and Contributions =

So I think we have two main paths forward -- gridding, and not
gridding.  I would very, very much appreciate comments or feedback on
either of these approaches.  I believe that, regardless of what we
choose to do, we'll need to figure out how to best apply a smoothing
kernel to data; I think this is probably going to be difficult.  But I
am optimistic we are nearing the point of usability.  For pure N-body
data, we are already there -- the trick is now to look at the data as
fluids, not just particles.

The best way to contribute help right now is to brainstorm and give
feedback.  I'd love to hear ideas, criticisms, flaws in my logic, ways
to improve things, and so on and so forth.  I think once we have an
MVP, we will be better set up to move forward.

If you'd like to add a reader for your own flavor of particle data, we
should do that as well -- especially if you are keen to start
exploring and experimenting.  The code so far can be found in the
repository linked above.  I'd also appreciate comments (or acceptance)
of the pull request.  I am also happy to hold a Hangout to discuss
this, go over things, help set up readers or whatever.  I consider
this a relatively high-priority project.  It will be a key feature of
yt 3.0.

(Speaking of which, today I've allotted time to focus on getting the
last major patch-refinement feature done -- covering grids.  Hopefully
patch-refinement people can test it out after that's done!)

Thanks for your patience with such a long email!  :)

-Matt