If we're going minimal here, then I don't think we need the topgrid entries.  In theory, everything one needs to know about The Simulation can come from the simulation_uuid.<br><br>Britton<br><br><div class="gmail_quote">

On Tue, Sep 6, 2011 at 8:21 AM, Matthew Turk <span dir="ltr"><<a href="mailto:matthewturk@gmail.com">matthewturk@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

(Summary: skip to the bottom, help decide what records should be in a<br>

minimal simulation database.  Please take the time to contribute to<br>

this discussion; it likely affects you, even if you don’t think it<br>

does!)<br>

<br>

Currently, in order to support pickling data objects in<br>

external-to-the-.yt-file formats, yt keeps a record of the most recent<br>

N (where N is usually 500-1000) parameter files it has personally<br>

seen.  This is stored in a csv file, typically in<br>

~/.yt/parameter_files.csv .  The reasons for using CSV are pretty<br>

reactionary: there were for a while difficulties with getting sqlite<br>

everywhere, shelve was unreliable and difficult to sort unless you<br>

used a database backend, and I wanted a single file.<br>

<br>

We *already* have a simulation output database, but it’s in CSV.  If<br>

you check, you’ll find you have one.<br>

<br>

Very briefly, let me motivate why we have the .csv file (also covered<br>

in the method paper) and how it is modified.  The idea is that, before<br>

we had UUIDs in parameter files, it was challenging to identify a<br>

parameter file uniquely.  For instance, one occasionally will need to<br>

know “Did I mean *this* DD0070, or *that* DD0070?”  So using the<br>

CurrentTimeIdentifier (and falling back -- for other codes -- on the<br>

ST_CTIME, which is not immensely reliable) and a couple different<br>

pieces of information, a hash was constructed for that parameter file.<br>

 This has included the path and a few bits of info.<br>

<br>

So everytime you load a parameter file in yt, it looks in this .csv<br>

and if necessary updates the *path* for a given *hash*.<br>

<br>

Since the time of implementation, UUIDs have been inserted into Enzo;<br>

I have been in contact with a couple other code developers and there<br>

is a remote, outside possibility that such a rider could be added to<br>

some of them.  I am not holding my breath, so we will continue falling<br>

back on this hash, which is for all intents and purposes likely to be<br>

Universally Unique, although may not follow the code and so may not<br>

serve as a consistent Identifier.<br>

<br>

However, as shows up once in a while on the mailing lists, the .csv<br>

file can be problematic.  My approach was conservative, which leads to<br>

lots of small bits of IO on them.  (Not cool.)  Updating requires<br>

rewriting the entire file.  (Not cool.)  If you have multiple<br>

processes going, sometimes they can be corrupted.  (Not cool.)  Plus,<br>

it’s just not a terribly easily-queryable format.<br>

<br>

Recently, inspired by something Tom talked about at the Enzo Spring<br>

Workshop, I put into Enzo the ability to write out a set of<br>

information about a simulation output at the same time the data itself<br>

was written.  The format for this set was in SQLite3, which is simple,<br>

easy to install, and -- most importantly for my near-obsessive goals<br>

-- has been installed with yt for a couple months now.  This last<br>

weekend, I was able to insert the first pass at a transition to using<br>

this format of database in yt, in my fork:<br>

<br>

<a href="https://bitbucket.org/MatthewTurk/yt/changeset/1b685b6bfbf2#chg-yt/utilities/parameter_file_storage.py" target="_blank">https://bitbucket.org/MatthewTurk/yt/changeset/1b685b6bfbf2#chg-yt/utilities/parameter_file_storage.py</a><br>


<br>

In this commit I also add an ORM called “pee wee” that abstract all of<br>

the SQL out; in this way we actually can use a SQLite database while<br>

not having any raw  text string SQL in yt.<br>

What the main changes are boils down to:<br>

<br>

 * No more CSV; all sqlite, which handles multi-process locking and cocurrency.<br>

 * Reads from the same database that Enzo writes out, and that other<br>

codes could choose to write out if they wanted.<br>

 * Uses UUIDs instead of hashes.  (Could potentially break existing<br>

pickles, but I plan to provide a migration strategy.)<br>

<br>

There are some big advantages to having an output database -- one<br>

could select all outputs based on the simulation they derived from (if<br>

such a field is available, as it is for enzo), one could load() just<br>

the UUID or hash, and we can provide a Reason “file open” GUI that<br>

doesn’t require touching the file system explicitly.  (The idea there<br>

is that Reason would pop up a grid of all available parameter files,<br>

with their times, redshifts, etc, and then you would choose.)  Plus,<br>

*other* utilities like Stranger, Jacques, etc could use it.  And long<br>

term, Tom’s vision of a universal simulation database becomes a *lot*<br>

more tractable, if we have a firm starting point.<br>

<br>

However, before this can be accepted into mainline, there are three<br>

things that need to be decided.<br>

<br>

Here are the currenty-included fields:<br>

<br>

dset_uuid<br>

output_type<br>

pf_path<br>

creation_time<br>

last_seen_time<br>

simulation_uuid<br>

redshift<br>

time<br>

topgrid0<br>

topgrid1<br>

topgrid2<br>

<br>

1) Thumbs up or thumbs down to moving to SQLite from CSV?<br>

2) Should any additional fields be *added*?<br>

3) Should any of these fields be *removed*?  (I have opinions on this,<br>

but I would like to hear from others first.)<br>

<br>

I think this is the wrong place to put everything there is to know<br>

about a simulation.  The parameter file exists for that, and every<br>

field is an additional overhead of space, complexity, etc.  But it<br>

would be nice if things that people *commonly* want to query on or<br>

*sort* on were in here.  What else?<br>

<br>

-Matt<br>

_______________________________________________<br>

Yt-dev mailing list<br>

<a href="mailto:Yt-dev@lists.spacepope.org">Yt-dev@lists.spacepope.org</a><br>

<a href="http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org" target="_blank">http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org</a><br>

</blockquote></div><br>