[Yt-dev] "Database" of Parameter Files

Mon Apr 4 14:00:23 PDT 2011

Hi Matt & others,

I've done some more thinking about this stuff and I have some
questions/thoughts I'd like your thoughts on.

I would like to set down some solid ideas of what we would like to get
out of this. This will allow us to design the system better from the
get-go. Here's what I've got:

- The ability to identify simulations datasets that are ordered in
time. For an Enzo dataset this is easy with the UUID values. For other
datatypes, I'm not so certain it would be quite so direct. There could
be some kind of similarity measure based on dataset parameters, or
perhaps keep it simpler and do some inference based on file paths.

- What are the important parameters of a simulation to record? Because
SQL is binary, it wouldn't be too difficult or unreasonable to store
everything in an Enzo restart text file, for example. Or similarly for
other datatypes. There's no reason why the database can't be
heterogeneous with one table per datatype, each with data field labels
of their own.

- What are the questions we'd like to be able to query? I can think of:
* Which simulations are in this simulations lineage?
* Similarly, what sets of time-ordered datasets do I have?
* What datasets are similar to this current simulation, or two similar
using some set of parameters I am specifying? The similarity could be
along the usual set of things: redshift, box size, resolution.
* What are the differences between two simulations/datasets in my collection?

- Another thing would be to add a step where the dataset is searched
for on disk, to see if it's still available. I could see this as an
optional, default==False step, due to the preponderance of sluggish
Lustre disk systems out there.

Thanks for the comments!

-- 
Stephen Skory
s at skory.us
http://stephenskory.com/
510.621.3687 (google voice)