[Yt-dev] "Database" of Parameter Files

Tue Apr 5 04:57:38 PDT 2011

Hi Stephen,

On Mon, Apr 4, 2011 at 5:00 PM, Stephen Skory <s at skory.us> wrote:
> Hi Matt & others,
>
> I've done some more thinking about this stuff and I have some
> questions/thoughts I'd like your thoughts on.
>
> I would like to set down some solid ideas of what we would like to get
> out of this. This will allow us to design the system better from the
> get-go. Here's what I've got:
>
> - The ability to identify simulations datasets that are ordered in
> time. For an Enzo dataset this is easy with the UUID values. For other
> datatypes, I'm not so certain it would be quite so direct. There could
> be some kind of similarity measure based on dataset parameters, or
> perhaps keep it simpler and do some inference based on file paths.

Yes, I agree.  The goal should be able to identify outputs from a
given "simulation run" ordered in time.  For Enzo this will be more
straightforward than perhaps for other codes, because it includes the
UUIDs of current and previous.  (My recollection is that this is why
UUIDs were added in the first place.)  I would say that inferring
based on file paths is fine for other simulation types, as well as
Enzo datasets that were created before the UUIDs were added.

>
> - What are the important parameters of a simulation to record? Because
> SQL is binary, it wouldn't be too difficult or unreasonable to store
> everything in an Enzo restart text file, for example. Or similarly for
> other datatypes. There's no reason why the database can't be
> heterogeneous with one table per datatype, each with data field labels
> of their own.

This is where I become less certain of things.  My initial feeling was
that we'd want a time indicator (redshift, time or both), a full path
to the dataset, and its position in a graph of simulation outputs.  It
makes sense to include additional parameters, but as you note it may
end up being that the only mechanism we have at the moment for doing
so is a set of heterogeneous table formats.

My initial hope for this database was twofold --

1) Provide a list of all "known" datasets in Reason (which we could
get from parameter_files.csv)
2) Provide a database that could be shared of simulations with
parameters.  (i.e., manual and simple publishing of datasets on shared
filesystems.)

Both of these kind of tie together.  On a shared filesystem, like
Kraken, one could imagine opening a dataset with:

load("db://sskory/SOME_SIMULATION_UUID/RedshiftOutput0030")

Anyway, I don't think we need to re-implement a *full* parameter
scraping, but it would not add so much overhead as to be undesirable.

>
> - What are the questions we'd like to be able to query? I can think of:
> * Which simulations are in this simulations lineage?
> * Similarly, what sets of time-ordered datasets do I have?
> * What datasets are similar to this current simulation, or two similar
> using some set of parameters I am specifying? The similarity could be
> along the usual set of things: redshift, box size, resolution.
> * What are the differences between two simulations/datasets in my collection?

This is exactly right, and I completely agree.  But I don't want to
completely reinvent the VO (it's done a good job of inventing itself)
but instead apply a simple layer of querying and comparison; it can be
pretty DIY, I think.

>
> - Another thing would be to add a step where the dataset is searched
> for on disk, to see if it's still available. I could see this as an
> optional, default==False step, due to the preponderance of sluggish
> Lustre disk systems out there.

This is where this is different from parameter_files.csv.  That system
acts as a FIFO of the last, say, 200 datasets.  When they're loaded,
the unique hash/ID is looked up, and if it's found in the .csv it's
updated to point to the new location.  Rather than providing an update
mechanism explicitly, it's done implicitly.  I don't see why this
couldn't be the same thing.

-Matt

>
> Thanks for the comments!
>
> --
> Stephen Skory
> s at skory.us
> http://stephenskory.com/
> 510.621.3687 (google voice)
> _______________________________________________
> Yt-dev mailing list
> Yt-dev at lists.spacepope.org
> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>