[Yt-dev] Simulation Database

Tue Sep 6 08:34:35 PDT 2011

If we're going minimal here, then I don't think we need the topgrid
entries.  In theory, everything one needs to know about The Simulation can
come from the simulation_uuid.

Britton

On Tue, Sep 6, 2011 at 8:21 AM, Matthew Turk <matthewturk at gmail.com> wrote:

> (Summary: skip to the bottom, help decide what records should be in a
> minimal simulation database.  Please take the time to contribute to
> this discussion; it likely affects you, even if you don’t think it
> does!)
>
> Currently, in order to support pickling data objects in
> external-to-the-.yt-file formats, yt keeps a record of the most recent
> N (where N is usually 500-1000) parameter files it has personally
> seen.  This is stored in a csv file, typically in
> ~/.yt/parameter_files.csv .  The reasons for using CSV are pretty
> reactionary: there were for a while difficulties with getting sqlite
> everywhere, shelve was unreliable and difficult to sort unless you
> used a database backend, and I wanted a single file.
>
> We *already* have a simulation output database, but it’s in CSV.  If
> you check, you’ll find you have one.
>
> Very briefly, let me motivate why we have the .csv file (also covered
> in the method paper) and how it is modified.  The idea is that, before
> we had UUIDs in parameter files, it was challenging to identify a
> parameter file uniquely.  For instance, one occasionally will need to
> know “Did I mean *this* DD0070, or *that* DD0070?”  So using the
> CurrentTimeIdentifier (and falling back -- for other codes -- on the
> ST_CTIME, which is not immensely reliable) and a couple different
> pieces of information, a hash was constructed for that parameter file.
>  This has included the path and a few bits of info.
>
> So everytime you load a parameter file in yt, it looks in this .csv
> and if necessary updates the *path* for a given *hash*.
>
> Since the time of implementation, UUIDs have been inserted into Enzo;
> I have been in contact with a couple other code developers and there
> is a remote, outside possibility that such a rider could be added to
> some of them.  I am not holding my breath, so we will continue falling
> back on this hash, which is for all intents and purposes likely to be
> Universally Unique, although may not follow the code and so may not
> serve as a consistent Identifier.
>
> However, as shows up once in a while on the mailing lists, the .csv
> file can be problematic.  My approach was conservative, which leads to
> lots of small bits of IO on them.  (Not cool.)  Updating requires
> rewriting the entire file.  (Not cool.)  If you have multiple
> processes going, sometimes they can be corrupted.  (Not cool.)  Plus,
> it’s just not a terribly easily-queryable format.
>
> Recently, inspired by something Tom talked about at the Enzo Spring
> Workshop, I put into Enzo the ability to write out a set of
> information about a simulation output at the same time the data itself
> was written.  The format for this set was in SQLite3, which is simple,
> easy to install, and -- most importantly for my near-obsessive goals
> -- has been installed with yt for a couple months now.  This last
> weekend, I was able to insert the first pass at a transition to using
> this format of database in yt, in my fork:
>
>
> https://bitbucket.org/MatthewTurk/yt/changeset/1b685b6bfbf2#chg-yt/utilities/parameter_file_storage.py
>
> In this commit I also add an ORM called “pee wee” that abstract all of
> the SQL out; in this way we actually can use a SQLite database while
> not having any raw  text string SQL in yt.
> What the main changes are boils down to:
>
>  * No more CSV; all sqlite, which handles multi-process locking and
> cocurrency.
>  * Reads from the same database that Enzo writes out, and that other
> codes could choose to write out if they wanted.
>  * Uses UUIDs instead of hashes.  (Could potentially break existing
> pickles, but I plan to provide a migration strategy.)
>
> There are some big advantages to having an output database -- one
> could select all outputs based on the simulation they derived from (if
> such a field is available, as it is for enzo), one could load() just
> the UUID or hash, and we can provide a Reason “file open” GUI that
> doesn’t require touching the file system explicitly.  (The idea there
> is that Reason would pop up a grid of all available parameter files,
> with their times, redshifts, etc, and then you would choose.)  Plus,
> *other* utilities like Stranger, Jacques, etc could use it.  And long
> term, Tom’s vision of a universal simulation database becomes a *lot*
> more tractable, if we have a firm starting point.
>
> However, before this can be accepted into mainline, there are three
> things that need to be decided.
>
> Here are the currenty-included fields:
>
> dset_uuid
> output_type
> pf_path
> creation_time
> last_seen_time
> simulation_uuid
> redshift
> time
> topgrid0
> topgrid1
> topgrid2
>
> 1) Thumbs up or thumbs down to moving to SQLite from CSV?
> 2) Should any additional fields be *added*?
> 3) Should any of these fields be *removed*?  (I have opinions on this,
> but I would like to hear from others first.)
>
> I think this is the wrong place to put everything there is to know
> about a simulation.  The parameter file exists for that, and every
> field is an additional overhead of space, complexity, etc.  But it
> would be nice if things that people *commonly* want to query on or
> *sort* on were in here.  What else?
>
> -Matt
> _______________________________________________
> Yt-dev mailing list
> Yt-dev at lists.spacepope.org
> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.spacepope.org/pipermail/yt-dev-spacepope.org/attachments/20110906/158d6506/attachment.html>