[yt-dev] Pre-YTEP proposal: eliminate a global mesh for particles

Fri May 27 17:31:55 PDT 2016

Hi Nathan,

Thanks for your comments.  I wanted to briefly reply; it's the start of a
holiday weekend in the US and most everywhere else has adjourned on a
Friday, so I don't anticipate this thread moving much in the next little
while.  (Nor should it!)  So I wanted to quickly dive in just a bit.

On Fri, May 27, 2016 at 6:34 PM, Nathan Goldbaum <nathan12343 at gmail.com>
wrote:

>
>
> On Friday, May 27, 2016, Matthew Turk <matthewturk at gmail.com> wrote:
>
>> Hi all,
>>
>> As some of you know, Meagan Lang and I have been working on a new
>> system for indexing particles quickly.  I'm writing an email today as
>> a pre-YTEP phase of getting feedback before moving forward on a plan
>> of action.
>>
>> As of yt 3.3 (dev), the way particles work is as follows:
>>
>>  * During ingest, build a global octree mesh.  (This is problem #0.)
>> This requires loading a uint64 into memory for every single particle,
>> sorting it, and then building an octree.  The octree construction is
>> pretty fast; the sorting is one of the slower parts, and a uint64 for
>> every particle is a pretty bad limiter.  (1024^3 => 8 gigs for the
>> array, then sorting.)
>>  * A `regions` object is stored on the index, which is of dimensions
>> (min(N_data_files,256))^3.  This stores a bitmask of which files fall
>> into each cell in that conceptual unigrid object.
>>  * The `regions` object gets queried whenever an object is selected,
>> and then it gets used to identify which data files should be loaded.
>>  * Whenever smoothing happens, 100% of the particles from the files
>> (for the data object in question) are loaded into memory at once.
>> These are queried, etc, through a "gather" method.  Each mesh point
>> that is being evaluated finds its neighbors, then smoothes onto
>> itself.  If you make a projection of the full data set, with a
>> standard `ProjectionPlot`, all of these things happen.
>>  * Parallelism does not help SPH datasets for any smoothing operation
>> or any `spatial` chunk operation.  Parallelism does not help the
>> indexing operation.
>>
>> This is problematic in a number of ways; the biggest, most obvious way
>> is that every time the dataset is loaded, a mesh octree is made, this
>> takes up a ton of RAM, and it only gets used for smoothing deposition
>> operations.  And, it's slow.
>>
>> And one thing I want to make very clear at this point: using it in
>> this way is not ideal.  This mesh octree is supposed to provide an
>> estimate of the density in a given location; then we smooth onto that.
>> This preserves the "project once, reuse many" idea that we've had in
>> yt in the past.  (For that particular purpose, it's fine.)  But, since
>> we use an MX Octree, which by its nature is not balanced by particle
>> count, we end up with empty zones.  Furthermore, during smoothing
>> operations, we can run into cases where we under-find particles.
>> (This can result in bad results for smoothing; one solution I'm
>> exploring for this is using a balanced tree like a kD-tree, which is
>> the obvious solution that I'm embarrassed I haven't explored before.)
>> We aren't being as true to the data as we should be when we do this.
>> Plus, if we do this for projections, we're making a *ton* more work
>> than we need to.  We build an octree, smooth onto the octree, then
>> immediately collapse it to a quad tree.  (Yes, really.)
>>
>> Meagan has been pushing forward work to implement a fast,
>> memory-conservative indexing system.  It works, and it works really
>> well, for indexing specific regions of the fileset that make up a
>> dataset.  The idea is to develop a hierarchical indexing system that
>> does not presuppose there is any sorting of particles within the
>> files, but which can benefit from such a sorting.  It creates two
>> levels of compressed bitmap indices, one for each data file (or
>> sub-chunk of a data file), where the second level is sparse and only
>> exists in regions where more than one data file has particles.
>>
>> What we've discovered when attempting to apply this to smoothing, is
>> that the biggest problem comes from the notion that there is a single,
>> fiducial "mesh" that exists everywhere.  What I mean by that is that
>> if such a mesh existed, we could chunk over each file in whatever
>> order we want, applying particles and smoothing them and whatnot.
>> Generating such a mesh in a memory efficient way is very difficult,
>> and we have yet to identify a mechanism by which we can do that
>> without resulting in somewhere between N and N^2 reads of files for
>> every smoothing operation.  Either we have a single, global mesh
>> (which, unless we're wrong, requires either multiple passes over files
>> until convergence *or* a full in-memory set of particles) or we do
>> lots of over-reads to make sure that the mesh isn't dependent on which
>> file we are reading and when.  An issue with this which comes up
>> immediately is that if we don't have a fiducial mesh, then depositing
>> a particle or smoothing a particle will give different results (which
>> cannot be combined in a well-defined way) depending on the order it is
>> done.
>>
>> What Meagan's built gives us a fast index, as well as an estimate of
>> where regions of high particle count can be found; it further can give
>> us a really good idea of how to build a quadtree that over-resolves
>> structure.  (Comes in handy below.)
>>
>> I propose we eliminate the notion of a fixed global mesh for particles.
>>
>> There are a few places in the code that *right now* assume a global mesh:
>>
>>  * Projection and slice plots of "smoothed" quantities
>>  * Smoothing operations that are not on user-defined (`arbitrary_grid`)
>> meshes
>>  * Volume renderings (which don't really work anyway)
>>  * Ray tracing individual rays (particularly important for light ray
>> and trident)
>>
>> I'll address each of these individually, but I want to start with this
>> statement: we should not make assumptions about global meshes, but we
>> should make such assumptions and decisions *available* to the people
>> using the code.  This means, for instance, that we should continue to
>> implement and make available smoothing operations, but we expose them
>> through an `arbitrary_grid` object.
>>
>> Projections and slices of non-local particle quantities (i.e.,
>> density, temperature, etc, of gas particles) are presently handled by
>> smoothing onto a mesh.  We should instead be doing some method of
>> pixelizing these directly, either onto a quadtree that can be reused,
>> or onto a buffer directly.  Ting-Wai To, an undergrad at Illinois, has
>> built an implementation in Cython of a flat pixelizer that's
>> reasonably fast.  What I would propose is using the quad tree mesh
>> found during the indexing step, and then smooth over the particles
>> onto this mesh directly.  This means we can use *just* an `io` chunk,
>> rather than (implicitly) a `spatial` chunk for SPH particles when
>> smoothing them.  Integrating the overall SPH smoothing into the
>> visualization layer, rather than the data access layer, also means
>> that we can rotate particles more easily.
>>
>> Smoothing operations that are not on user-defined meshes, such as
>> computing the total density inside a sphere, are currently using the
>> same method as above.  Two main uses of this happen -- computing some
>> quantity and then doing a radial profile of it, and computing a
>> density or something and comparing it to another quantity.  However,
>> it is becoming increasingly clear that this is not what most people
>> using SPH codes *want*.  It is not a particularly bad approximation as
>> long as the resolution is high enough, but it's also not particularly
>> good, and it is lossy.  Instead, we should be regarding the particles
>> as particles, and computing profiles on *those*.  This could cause
>> issues in regimes where a particle crosses bin boundaries when the
>> mesh point doesn't, but I would assert that this is an ill-posed
>> operation to begin with.
>>
>> Volume rendering at present is problematic.  An octree is really the
>> only way we *could* do things, but it's also not terribly good for it.
>> We'd be smoothing, volume rendering, then visualizing.  The only
>> reason we do it this way is because we only have a VolumeSource, and
>> no SmoothedParticleSource.  We should instead implement a
>> SmoothedParticleSource, which would itself be considerably faster and
>> easier to implement, I would wager, than forcing the issue in the
>> opposite direction.  There are remnants of this in the code from star
>> particle rendering which could be used in this case.
>>
>> Finally, and the one I'm the most worried about, is ray tracing of
>> individual rays.  We need to ensure that we are able to support the
>> needs of communities that build on yt functionality, such as trident.
>> I believe, but may be incorrect, there are essentially two modes we
>> need to support here.  The first is specific query points, and the
>> other is aggregate sums.  Query points are where you evaluate the
>> value of some quantity at multiple points along a ray running from A
>> .. B, and aggregates are where you evaluate the sum total of some
>> value along the ray.
>>
>> In the former case, for query points, we actually can rely on existing
>> functionality rather nicely.  We would be able to specify points (or
>> even implicitly choose them to be at some regular interval or
>> something), evaluate smoothing kernel at those points, and return
>> those values.  This is identical to how it operates now, except that
>> the points would need to be specified in some manner rather than
>> simply being the locations that the ray traverses an oct cell.  I
>> believe this would improve the fidelity of the calculation.
>>
>> The latter case, for aggregation, could build on work such as that
>> done by SPLASH for quickly evaluating integrals along lines of sight.
>> This would also result in a better conservation of quantities, as the
>> integral would have sample points defined by the extent of the
>> smoothing kernel, rather than wherever the particle happened to be
>> compared to the mesh points.
>>
>> The fallback operation in all of these is to construct an
>> `arbitrary_grid` object and smooth or deposit there.  And, we will
>> still make available the ability to generate a "full" octree for when
>> you really need one; this will just no longer be the default, and to
>> get at it you might have to work a little.
>>
>> I am of the belief, after having been convinced by compelling evidence
>> around performance, memory use and data fidelity, that we should do
>> away with a single global mesh and instead move to having fast point
>> selection and indexing, coupled with pushing the data representation
>> close to the real data instead of faking it to look like particles.
>> This is probably the most invasive change, especially as it undoes a
>> lot of things I worked really hard on implementing and optimizing, but
>> I think if we continue to do things as they are, we're just going to
>> be stuck.  We could also evaluate having a global mesh as an option,
>> but for deposition/smoothing, it's not going to be guaranteed to be
>> optimal or correct.
>>
>> I'm not going to propose any syntactic sugar or anything of the like
>> in this email, although that certainly can/should come up in the YTEP.
>> (For instance, .grid() on a data object, perhaps?  But we also already
>> have ds.r[] to get arbitrary grids, too!)  But, I wanted to send this
>> to the list so that I can get a sense of any *very* broad brush
>> feedback before moving forward with a YTEP, and before we start
>> adjusting the code (in advance of a PR) to move in this direction.
>> I'm certainly not proposing we do this for 3.3 (or rather, that we
>> hold 3.3 for this.)  If anyone has any comments, opinions (strong or
>> fleeting), suggestions, or if you maybe know how to fix all this stuff
>> in a way that Meagan and I haven't thought of, we'd be happy to hear
>> them.
>
>
> Just a few initial comments after reading this over:
>
> This is a big change, and will require a good amount of work to rework yt
> internals after merging Meagan's work in. If we decide to do it, it will
> require enough work that it may be worth thinking about doing it as part of
> yt 4.0, which we can branch off after 3.3 is released. This would also free
> us to think about other large changes, like reworking the field system to
> be more friendly to code generation, symbolic definition of fields,
> and acceleration of field computation via pipelining, as well as splitting
> off analysis modules into subprojects outside yt.
>

That sounds reasonable.  I'm not sure yet if a 4.0 would be necessary (if
the user-facing API remains the same, for instance), but in theory I'm not
opposed.

This would change the *internals*, but I believe that it would be able to
be implemented without breaking API -- and people using yt for SPH would
simply see an improvement in speed and quality.

>
> We need to be careful about changing things without breaking functionality
> or providing clear migration paths if something needs to be removed. This
> means expanding test coverage, and identifying functionality that will
> break without some refactoring to use new per-particle focused data
> processing algorithms. I think an ideal way to handle this is to adapt
> existing interfaces to use new particle-focused data analysis rather than
> mesh-focused analysis. Just as an example, using your radial profile
> example, we could make it so the ProfileND class silently delegates to a
> particle-based or SPH-based profiling algorithm for a particle field.
> Ideally, we want to present the same high-level interfaces to users of
> grid, particle, and unstructured mesh codes as much as possible.

I agree completely with this.

>
> No matter what, we need to think carefully about this and try to avoid
> biting off more than we can chew as a community or as individual
> contributors. At the same time, we need to be careful to keep breaking
> changes to a minimum  - I don't want to see a repeat of the mixed reactions
> to the early yt 3.x releases due to broken or missing functionality that
> didn't get ported from the 2.x codebase.
>
>

Agreed.  And as I've said before, I bear a large portion of that
responsibility.  I think that if we move ahead with this we can do so
having learned from the yt 3 transition.

One additional note -- I mentioned, but didn't go into detail, that this
would need to be coordinated with external packages that use the internal
yt API.  I mean that, because I want to ensure that if we do this, they
have as little (or even no) transition cost to their own APIs and internal
workings, and that they would see either no change or an improvement to
results.  This means trident, powderday, the xray package John developed
whose name escapes me right now (sorry John!) and so on.

-Matt

>
>> Thanks for reading this far,
>>
>> Matt
>> _______________________________________________
>> yt-dev mailing list
>> yt-dev at lists.spacepope.org
>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>
>
> _______________________________________________
> yt-dev mailing list
> yt-dev at lists.spacepope.org
> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.spacepope.org/pipermail/yt-dev-spacepope.org/attachments/20160527/8e818453/attachment-0001.htm>