[yt-dev] sqlite parameter files storage

Mon Dec 19 14:14:53 PST 2011

Hey all (Matt, in particular),

I've got some dead time waiting for jobs to run so I'd like to A)
discuss this topic and B) make some "final" decisions about this so I
can go ahead and do some coding on this. Sorry about the length of
this!

A) Briefly, for those of you who aren't aware of this topic, the idea
is to replace the ~/.yt/parameter_files.csv text file with a SQLite
database file. This has many advantages over a text file, too many to
list here. But in particular, it has built-in locks for writes (*),
which is especially useful for multi-level parallelism. This is
something we're currently addressing in "official" examples with a
kludge [0]. I think everyone is in agreement that this is a good
thing, no?

The other big thing that this feeds into is a remote, centralized
storage point for a clone of this database. I've discussed this idea
before, sketched up a simple partially functional example, and made a
simple video cast of how it works. [1]

B) The final decisions that I'd like input on are these.

- What data fields should we include in the databases? There are three
ways to go with this.
  #1. The same amount of data that is in the current csv (basically:
hash, name, location on disk, time, type). This is probably too few
data fields, so I think we can scratch it off immediately.
  #2. Everything that can be gleamed from the dataset. This is
actually fine to do practically because of the database being binary
and searchable. However, because the fields in various datasets are so
different, this could result in a fairly unwieldy database with (in a
Chris Traeger voice) a literal ton of columns. This could be mitigated
by having a different database tables for each type of dataset (Enzo,
Athena, etc...), but that really only swaps one kind of complexity for
another.
  #3. A minimal set of "interesting" fields (redshift, box resolution,
cosmological parameters, etc..) This is more attractive than #2 in
that it's very unlikely anyone will want to search over every field in
a dataset, so it keeps things more streamlined. But then we have to
agree to a reasonable set of parameters to include, and it makes
future changes a bit more difficult.

What do we all think?

- Once we have the above settled, and working, I would like to extend
the functionality to the cloud bzzzzzzz. Get it? It's a buzz word. So
it buzzes. Thanks, I'll be here all week.

There are three (four) ways to do this that I can think of

  #1. Amazon Simple DB. The advantages of this is that it's offered
free to all up to 1GB of storage and some reasonable limit of
transactions per month. Each user sets up her own account on S3, and
no one else has to be involved. But the main disadvantage is that it
only supports storing things as strings, which makes numerical
searches and sorts less useful, more annoying, and slower.
  #1.5. Amazon Relational DB. This is not free at any level, but it
offers all the usual DB functionality. Amazon does offer some
educational grants, so we could apply for that. This service is
targeted at usage levels that we will never reach, but if we get free
time, that's fine. I think in this case (and the next two) user
accounts on the database would have to be created for yt users by
"us".
  #2. Google App Engine. Free right now in pre-beta invitation-only
phase. It will be similar or #1.5 above, as I understand things, and
not be free forever. Personally, I seriously doubt that we'd get in on
the pre-beta. I've looked at the application form [2] and I don't even
understand one of the questions.
  #3. Host a MySQL (or similar) database on one of our own servers
(yt-project or similar). The advantage is that the cost should be no
more that Matt is paying now. The disadvantage is, again, we have to
set up accounts. Also, I don't know if Dreamhost (is that where
yt-project is still?) allows open MySQL databases. Another advantage
is that unlike #1.5 or #2 above, costs should never rise suddenly when
an educational grant or beta period ends.

Thanks for reading, and any and all comments are welcomed.

[0] http://yt-project.org/doc/advanced/parallel_computation.html#parallelizing-your-analysis
[1] http://vimeo.com/28797703
[2] https://docs.google.com/spreadsheet/viewform?formkey=dHBwRmpHV2VicFVVNi1PaFBvUGgydXc6MA

(*) There are issues with locks on parallel network file systems, but
most home partitions on supercomputers are NFS (not something like
Lustre) so this shouldn't be a problem.

-- 
Stephen Skory
s at skory.us
http://stephenskory.com/
510.621.3687 (google voice)