[yt-dev] RFC: (yt-3.0) Geometry handling redesign

Fri Feb 24 10:46:29 PST 2012

Great stuff Matt.

+1 on the presented GeometryHandler design. It doesn't sound like it
changes performance much and it's a nicer organization.

I like the chunking work, but I'm confused about syntax and what goes
where. Does each data object define its own chunks method? In the chunks
method, would it be clearer to return something like a "chunk" data object
instead of self? This probably doesn't matter though, since a user would
not care about type(ds).

I haven't worked with particle types in yt (it's all DM to me!), but I
would prefer the `obj.particles["position_x"]` syntax. With the current
design, I don't like that I have to inspect fields to see if they are
particle fields or on the grid. This is fine now because I am familiar with
the fields in my data, but I remember this being confusing as a new user.

For particle type selection, I think syntax like
`obj.particles["PopIICluster"]["position_x"]`
or even `obj.particles.type("PopIICluster")["position_x"]` is clearer. With
a tuple it almost looks like you are selecting two fields. If we go with
the tuple though, can we make the particle type the first element? I have
no idea about mandating a particle type, but we can hash that out in the
hangout.

Best,
Casey

On Fri, Feb 24, 2012 at 8:12 AM, Matthew Turk <matthewturk at gmail.com> wrote:

> Hi all,
>
> This email is going to be long, but the changes I am proposing in it
> will require feedback.  Next Monday at 1PM Eastern I'll be holding a
> hangout on Google+, if you would like to join in and discuss it with
> lower latency.  (Of course, if nobody shows up, that's okay too.)  We
> don't really have a detailed review process for design decisions like
> this, and I hope that this discussion will guide developing that kind
> of review process in the future.
>
> Everything described here is in the branch geometry_handling at
> https://bitbucket.org/MatthewTurk/yt-refactor
>
> = Background =
>
> Geometry in yt has always been designed around two things:
>
> 1) Grid patches
> 2) Spatial location based on bounding boxes
>
> This has presented a number of excellent opportunities for optimizing
> patch-based AMR data.  However, it has posed challenges for supporting
> particle-based datasets.  (Particles inside patch-based data are
> somewhat easier, although if you want to see one of the worst pieces
> of yt, look no further than FakeGridForParticles, which mocks up
> particles as grids with varying x,y,z, so that existing
> fluid-selection routines can be used.)  Furthermore, the state of
> Octree codes is not good.  For relatively small (up to about 64)
> domains in RAMSES, we can do analysis with relative ease -- volume
> rendering, projections, fluid analysis, etc, all work very nicely.
> However, what is required is a costly step of regridding octs at the
> outset, to form grid patches with some efficiency.  This regridding is
> not stored between runs, and it's not *efficient*.
>
> This has worked so far, but it is showing its strain.  As it stands,
> yt cannot directly support SPH codes without somehow making SPH look
> like grids.  Octree codes can't be supported without making them look
> like grid codes.  Both of these are very slow operations.  And on top
> of that, the current state of how data is read in, even for grid
> codes, is a bit gross.  There are a number of conditionals, mechanisms
> for striding over data (i.e., "_get_grids()" which is called in
> parallel, and iterates over grids, and sometimes you have to call
> _get_point_indices and sometimes you don't), and probably worst of all
> a lot of unnecessary memory duplication because arrays in
> regions/spheres/etc are constructed by iterating over grids, filling
> in a list of arrays, and then concatenating that list into a second
> array.
>
> Furthermore, several times recently the question has come up of
> handling and supporting more generic geometries: spherical, polar,
> cylindrical.  Unfortunately, because we're so heavily based around the
> idea of grids as is, these are difficult to support.  All of our
> collision detection has been designed with rectilinear surfaces in
> mind.  That being said, I'm really quite happy with how yt *functions*
> for patch-based grids.  I'm just not sure that as it stands it is
> maintainable, and I think that addressing the geometry issues could
> ultimately also help out a great deal with scalability.  (More on that
> below.)
>
> = Geometry Overhaul =
>
> For the last couple weeks, I have been working on an overhaul of the
> geometry system in yt, attempting to handle parallel striding in a
> format-agnostic way, deal with generating fields in as simple a way as
> possible, and also attempting to speed things up.  I'm going to
> summarize below the changes that I've made, the new design, and the
> current status.
>
> == Design: Selectors ==
>
> As it stands, yt data objects (such as Region, Sphere, etc) are mostly
> thought of as conduits for data.  They apply a criteria to the domain
> and select all of the fluid points or particles within a region or
> that fit a specific non-spatial criteria.  I have extended this
> concept to construct very lightweight objects called selectors,
> similar to how ParticleIO is now conducted for Enzo.  These are
> objects in Cython that accept either a set of bounding-box arrays or a
> set of spatial points and decide if those points are included in a
> given region.  This is essentially a combination of the
> get_point_indices and get_cut_mask methods that currently exist.
> These have all been designed to be GIL-free, so that in the future
> they can be threaded using OpenMP (although it turns out not very much
> time is spent on them.)  To date I have implemented spheres, regions,
> disks, cutting planes, slices, ortho rays, and grid collections.  They
> all give bitwise identical results to the old method and are almost
> all substantially faster.  (Cutting planes are particularly better,
> regions are slightly slower, and the others are all about 25-40%
> faster.)
>
> One primary advantage is that these can (and do) trivially provide
> counting methods, masking methods, and point-evaluation methods.  So
> we can use the same routines to select particles and individual points
> (for collision detection) that we use for selecting grid cells.  This
> will allow reuse of data object code with particle outputs, and is a
> primary advantage of the refactor.
>
> This also led to separating selection-based data containers from
> constructed data containers.
>
> == Design: Ditching the Hierarchy ==
>
> (Hopefully that heading caught some eyes.)  In line with attempting to
> reduce Enzo-isms in the code, I have been moving away from
> specifically calling something a "hierarchy," and instead moving
> toward referring to geometry handlers.  grids, grid_left_edges,
> grid_right_edges, and so on will no longer be guaranteed to appear
> hanging off these objects.  The new class structure:
>
> GeometryHandler
> GridGeometryHandler(GeometryHandler)
> OctreeGeometryHandler(GeometryHandler)
>
> and eventually, other methods.  The concept behind this is to isolate
> those items that *must* require domain knowledge, mesh knowledge, or
> particle/fluid distribution knowledge, and place those in the
> GeometryHandler.  All items specific to patch based or block based AMR
> will go in GridGeometryHandler and so on.  This will allow for
> fundamentally different method of IO for octrees from grids, for
> instance.  The geometry handlers are all required to implement the
> following methods:
>
> _read_selection(self, fields, data_object, chunk)
> _chunk(self, data_object, chunking_style, ngz = 0)
>
> I will describe chunks later.  Note that the first routine,
> _read_selection, accepts the names of the fields, and the data object
> that describes how to select which portions of a dataset participate
> in a given selection.  For the specific case of Enzo (as I am still in
> design phase, I have only been working with Enzo data, although I will
> move on to FLASH and Boxlib next) read_selection calls a routine that
> counts and masks grids, batches operations to files, and then returns
> only the necessary data for a selection.  The data object does no
> discarding of out-of-bounds or child-masked data.
>
> == Design: Chunking ==
>
> Currently, yt uses de facto chunking of data in the form of grids.
> When Derived Quantities want to handle individual grids, one at a
> time, they "preload" the data from whatever grids the
> ParallelAnalysisInterface thinks they deserve.  These grids are
> iterated over, and handled individually, then the result is combined
> at the end.  Profiles do something similar.  However, both of these
> are de facto, and not really designed.  They rely on calls to
> semi-private functions on data objects, manually masking data, on and
> on.
>
> To get around this ugly spot, I have started working on redesigning IO
> and data handling to function in "chunks."  In this method, data
> objects will request that the geometry handler supply a set of chunks
> (the function _chunk, described above, conducts this task.)  Chunks
> are of the form (IO_unit, size), where IO_unit is only ever managed or
> handled by _read_selection, above.
>
> To start the Chunks shuffling over the output, the code calls
> data_source.chunks(fields, chunking_style).  Right now only "spatial",
> "io" and "all" are supported for chunking styles.  This corresponds to
> spatially-oriented division, IO-conserving, and all-at-once (not
> usually relevant.)  What happens next is kind of fun.  The chunks
> function looks like this:
>
>    def chunks(self, fields, chunking_style, **kwargs):
>        for chunk in self.hierarchy._chunk(self, chunking_style, **kwargs):
>            with self._chunked_read(chunk):
>                self.get_data(fields)
>                yield self
>
> Note what it does here -- it actually yields *itself*.  However,
> inside the chunked_read function, what happens is that the attributes
> corresponding to the size, the current data source, and so on, are set
> by the geometry handler (still called a hierarchy here.)  So, for
> instance, execution might look like this:
>
> for ds in my_obj.chunks(["Density"], "spatial"):
>  print ds is my_obj
>  print ds["Density"].size
>
> The first line will actually print True, but the results from the
> second one will be the size of (for instance) the grid it's currently
> iterating over.  In this way, it becomes much easier to stride over
> subsets of data.  Derived quantities now look like this:
>
>        chunks = self._data_source.chunks([], chunking_style="io")
>        for ds in parallel_objects(chunks, -1):
>            rv = self.func(ds, *args, **kwargs)
>
> It chunks things, evaluates and then stores intermediate results.
>
> This is not meant to replace spatial decomposition in parallel jobs,
> but it *is* designed to enable much easier and *mesh-neutral* division
> of labor for parallelism and for IO.  If we were to call chunk on an
> octree, it no longer has to make things look like grids; it just makes
> them look like flattened arrays (unless you chunk over spatial, which
> I haven't gotten into yet.)
>
> Essentially, by making the method of subsetting and striding over
> subsetted data more compartmentalized, the code becomes more clear and
> more maintainable.
>
> = Status =
>
> As it stands, I am able to select data very nicely -- this means
> slices, profiles, derived quantities all work.  However (see below)
> particles still need some work and handling, and constructed data
> objects (smoothed covering grids, projections) have yet to be ported.
>
> = Open Questions =
>
> However, the main reason I am writing -- when the code is not yet
> fully functional! -- is to ask for advice and consideration on a few
> points.
>
> == Particle Types ==
>
> In the past, we've talked about handling particle types differently
> depending on the particle.  Because this is destined for yt-3.0, I
> would like to propose that we break compatibility with old scripts in
> favor of clarity in our particle types.  Specifically, rather than
> doing:
>
> obj["particle_position_x"][obj["particle_type"] == 1]
>
> We do,
>
> obj[("particle_positions_x", 1)]
>
> (or obj[("particle_positions_x", "PopIICluster")] if you are using a
> code that supports such expression.)
>
> And if a field is found to be a particle type field, but without a
> tuple, we simply raise a KeyError.  We can emulate old behavior with:
>
> obj[("particle_position_x",None)]
>
> But I am inclined to say that I would rather *mandate* particle type
> selection, rather than merely allow it.  This will lead to much
> greater clarity, I think.
>
> [+-][01] on
>
> == Particle Access ==
>
> Currently, we access particles off a data_object:
>
> sphere["particle_position_x"]
>
> Do we want to keep this, or move to being much more explicit, that we
> require sphere.particles["particle_position_x"] ?  I can see both
> sides to this.  Explicit is good, and would allow us to be both more
> clear with users and about fields (and with our IO methods) but
> allowing old behavior encourages not having to necessarily worry about
> if something is a fluid or a particle, which is a good thing I think.
>
> == User Survey ==
>
> Should we conduct a brief (only a handful of questions) polling users
> for their feelings on the changes or potential changes to
> nomenclature, particles, and so on?  We could use it also as a time to
> identify other outstanding issues with yt.
>
> = Conclusion =
>
> Does this all make a sort of sense?
>
> [+-][01] on the design presented so far
>
> If you're interested in helping out, please get in touch.  I'm still
> laying groundwork in some areas, but in other areas I would appreciate
> help and advice moving forward.
>
> Thanks to everyone who made it this far,
>
> Matt
> _______________________________________________
> yt-dev mailing list
> yt-dev at lists.spacepope.org
> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.spacepope.org/pipermail/yt-dev-spacepope.org/attachments/20120224/aafdb70d/attachment.htm>