[yt-dev] summary of GDF discussion

Tue Jan 8 04:27:53 PST 2013

Hi Jeff,

Thanks for the summary -- this sounds really great.  I'm sorry I
couldn't make it.

On Mon, Jan 7, 2013 at 10:00 AM, j s oishi <jsoishi at gmail.com> wrote:
> Hi all,
>
> On Friday, 4 Jan 2013, we had a hangout to discuss what the next steps to be
> taken for the implementation of a C language library for GDF. We call this
> library implementation gdfio. For reference, the GDF standard can be found
> at
>
> https://bitbucket.org/yt_analysis/yt/src/554d144d9d248c6f70d8c665a5963aa39b2d6bb3/yt/utilities/grid_data_format/docs/gdf_specification.txt?at=yt
>
> One of the biggest issues we discussed was whether or not to rely on
> libraries other than HDF5 when writing gdfio. The main argument for doing so
> is that none of us are experienced C programmers; thus implementing things
> like hashes, linked lists, and so forth might be a barrier to making
> progress. The main argument against doing so is that gdfio should be a very
> low-level library that can be deployed on many different systems, and
> dependencies make this difficult.

My feeling is that the HDF5 requirement should eventually be stripped
out as well.  Ideally, we would be able to swap in and out backends --
whether these be remote servers over MPI, SQL databases, raw binary
readers, etc etc.

This is far beyond the first initial pass at the library or API,
however.  But I would suggest that the API be kept neutral to the
backend.

>
> In order to best assess what to do, we attempted to identify *what*
> non-stdlib C features we would actually need in our implementation. Because
> GDF itself does not make any links between grids (only recording their
> parents in an optional metadata step), we came to the conclusion that the
> only thing we need is a hash table. Even this hash table is optional and
> only for reading. For example, you could do something like
> gdfio_read_grid(grid_id, "density") and get back an object that includes
> both the density data and their associated metadata. We thus decided to
> proceed without *any* dependencies aside from HDF5.
>
> Kacper pointed out that the most important issue for efficiency is how to
> convert native memory structures into GDF's /data/grid_%010i/ structures
> without expensive copies. Native data could be 5D (block, quantity, x, y,
> z), 4D, or 3D depending on the code. On this point, Sam noted that the
> easiest thing might be to require a "buffer" type interface that gives
> (pointer, size) and allows gdfio to grab the requisite number of floats or
> doubles. We decided to try this buffer approach first. Essentially, this is
> a question of how much gdfio provides to users who will be wrapping it to
> write their code's data. For now, we decided to keep gdfio's offerings
> minimal. This allows us to see how it will work and what might be the best
> additional features to add later.

It sounds a bit like you're describing the general aspect of defining
ordering, strides, and views.  I think this could be a dangerous path
to go down; Sam's approach is I think the best, since one can either
cover *everything* (similar to how NumPy works, in fact) and deal with
all the possible use cases -- Fortran, 5D, strided in a particular
way, etc etc -- or one can cover nothing and mandate that a copy step
occur in memory by the code itself.

An alternate, which breaks what I suggested above about removing HDF5
dependencies, would be to provide ghost methods that accept HDF5
dataspace references.  This would allow the code that calls GDF to
provide the necessary stride / ordering information.  I don't
particularly like this though and my suspicion is that if these
methods were provided they would probably never get used.

I think the solution you proposed, of providing extremely minimal
options, is the most likely to result in timely success.

>
> Casey and Sam both brought up issues of how to parallelize. One issue is
> that parallel HDF5 can be quite tricky to deal with, so we decided to forgo
> using parallel HDF5 for now. We agreed that the simplest path forward is to
> use file links, so that each non-root-IO processor writes its data to a
> separate data-only file that is linked back to the main HDF5 file. This
> means we need to add an API for creating, writing, and reading from
> data-only files.

Why does the API need to be different for data-only files?

>
> Finally, we came up with the next step: I (Jeff) will draft an API for gdfio
> this week. I will submit it to yt-dev for discussion and iteration. Once we
> have an API that looks good, we'll begin coding it up.

Awesome!  Nice work.  This has some amazing possibilities.

Also, for anyone who hasn't seen, the yt blog has had some items of
interest related to grid data format stuff recently:

http://blog.yt-project.org/post/ParticleGenerator.html
http://blog.yt-project.org/post/Simple_Grid_Refinement.html

-Matt

>
> If I misrepresented anything from the meeting or GDF, please let me know.
> Thanks to all who participated!
>
> j
>
> _______________________________________________
> yt-dev mailing list
> yt-dev at lists.spacepope.org
> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>