[yt-dev] sqlite parameter files storage

Mon Dec 19 14:42:22 PST 2011

Hi Stephen,

On Mon, Dec 19, 2011 at 5:14 PM, Stephen Skory <s at skory.us> wrote:
> Hey all (Matt, in particular),
>
> I've got some dead time waiting for jobs to run so I'd like to A)
> discuss this topic and B) make some "final" decisions about this so I
> can go ahead and do some coding on this. Sorry about the length of
> this!
>
> A) Briefly, for those of you who aren't aware of this topic, the idea
> is to replace the ~/.yt/parameter_files.csv text file with a SQLite
> database file. This has many advantages over a text file, too many to
> list here. But in particular, it has built-in locks for writes (*),
> which is especially useful for multi-level parallelism. This is
> something we're currently addressing in "official" examples with a
> kludge [0]. I think everyone is in agreement that this is a good
> thing, no?

Yes.

>
> The other big thing that this feeds into is a remote, centralized
> storage point for a clone of this database. I've discussed this idea
> before, sketched up a simple partially functional example, and made a
> simple video cast of how it works. [1]
>
> B) The final decisions that I'd like input on are these.
>
> - What data fields should we include in the databases? There are three
> ways to go with this.
>  #1. The same amount of data that is in the current csv (basically:
> hash, name, location on disk, time, type). This is probably too few
> data fields, so I think we can scratch it off immediately.
>  #2. Everything that can be gleamed from the dataset. This is
> actually fine to do practically because of the database being binary
> and searchable. However, because the fields in various datasets are so
> different, this could result in a fairly unwieldy database with (in a
> Chris Traeger voice) a literal ton of columns. This could be mitigated
> by having a different database tables for each type of dataset (Enzo,
> Athena, etc...), but that really only swaps one kind of complexity for
> another.
>  #3. A minimal set of "interesting" fields (redshift, box resolution,
> cosmological parameters, etc..) This is more attractive than #2 in
> that it's very unlikely anyone will want to search over every field in
> a dataset, so it keeps things more streamlined. But then we have to
> agree to a reasonable set of parameters to include, and it makes
> future changes a bit more difficult.

Odd that you bring this up today!  This weekend I started work on a
project that ties into this.  It's in my repository "yt.hub" on
bitbucket, and it's a revamped hub -- unified pastebin, data pastebin
+ mapserver, and the project directory.  There is the possibility it
won't go anywhere, but I based it on Flask and I've got it supporting
mapserver of in-memory data components already.  I spent some time on
it today before preparing arxiv submissions, and I also created what I
am calling "minimal representations" in yt.  I have issued a pull
request for these:

https://bitbucket.org/yt_analysis/yt/pull-request/50/minimal-representation-of-objects

Here's the architecture I was aiming for, although I was hoping to
have a lot more done before sharing it with anybody.  Additionally,
the idea would be to stage these items.

 * Authentication of users: either through OpenID or local items
 * Pastebinning of data up to a given size
 * Storing, as a side effect of data pastebins, a mapping between
simulation data and simulation parameters, mediated by the 'minimal
representation'
 * Projects, like we have on the hub now
 * Pastes, like we have on the pastebin now

The idea I was exploring was to take an in-memory yt data object --
not the raw data, but rather a projection, a slice, a 2D or 1D phase
profile -- and upload that to a remote data repository.  One could
then go to the hub, view the available data, and spawn the mapserver,
or a plot fiddler, or something.  As a side effect of the fundamental
operations of the hub -- the data pastebin, the pastes, the project
dir -- we need to track simulation outputs.  And I think that's where
this plays in, and why I think we should have an identical data model
between the two.

As I mentioned above, I have this working for projections (I don't yet
have the multipart chunked upload going for uploading directly from
yt, but for projections manually inserted it works great).  The
minimal representation provides an easy way to both pass around an
object that can be pickled and unpickled (without having access to the
original data) and a mechanism for taking an object and converting it
to a POST request sent to an http server -- from the same underlying
object.

The columns inside the yt.hub right now mirror those stored in the
minimal representation.  So my vote would be whatever local storage is
fine, but that it be a superset of the attributes in the
MinimalStaticOutput.  Nathan's concern is a valid one, and I would be
loathe to require that someone manage lustre striping for fundamental
components of yt, but I am otherwise indifferent to the usage of
SQLite.  I would mainly like to emphasize that we try to focus on the
data, not the model.

>
> What do we all think?
>
> - Once we have the above settled, and working, I would like to extend
> the functionality to the cloud bzzzzzzz. Get it? It's a buzz word. So
> it buzzes. Thanks, I'll be here all week.

Your humor aside, there are very good reasons for hosting this in the
cloud.  The one that is most pressing, I feel, is that it allows for a
nice compartmentalization of a running server from everything else.
It's also potentially cheaper than buying hardware.

>
> There are three (four) ways to do this that I can think of
>
>  #1. Amazon Simple DB. The advantages of this is that it's offered
> free to all up to 1GB of storage and some reasonable limit of
> transactions per month. Each user sets up her own account on S3, and
> no one else has to be involved. But the main disadvantage is that it
> only supports storing things as strings, which makes numerical
> searches and sorts less useful, more annoying, and slower.
>  #1.5. Amazon Relational DB. This is not free at any level, but it
> offers all the usual DB functionality. Amazon does offer some
> educational grants, so we could apply for that. This service is
> targeted at usage levels that we will never reach, but if we get free
> time, that's fine. I think in this case (and the next two) user
> accounts on the database would have to be created for yt users by
> "us".
>  #2. Google App Engine. Free right now in pre-beta invitation-only
> phase. It will be similar or #1.5 above, as I understand things, and
> not be free forever. Personally, I seriously doubt that we'd get in on
> the pre-beta. I've looked at the application form [2] and I don't even
> understand one of the questions.
>  #3. Host a MySQL (or similar) database on one of our own servers
> (yt-project or similar). The advantage is that the cost should be no
> more that Matt is paying now. The disadvantage is, again, we have to
> set up accounts. Also, I don't know if Dreamhost (is that where
> yt-project is still?) allows open MySQL databases. Another advantage
> is that unlike #1.5 or #2 above, costs should never rise suddenly when
> an educational grant or beta period ends.

For the newly re-envisioned hub, I was thinking EC2 instances.

>
> Thanks for reading, and any and all comments are welcomed.
>
> [0] http://yt-project.org/doc/advanced/parallel_computation.html#parallelizing-your-analysis
> [1] http://vimeo.com/28797703
> [2] https://docs.google.com/spreadsheet/viewform?formkey=dHBwRmpHV2VicFVVNi1PaFBvUGgydXc6MA

(As a note, I find it pretty jarring to have to dig up your footnotes.
 It's way easier if you include them inline in the text, set off with
spaces or something.  Maybe I'm the only one ...)

>
> (*) There are issues with locks on parallel network file systems, but
> most home partitions on supercomputers are NFS (not something like
> Lustre) so this shouldn't be a problem.
>
> --
> Stephen Skory
> s at skory.us
> http://stephenskory.com/
> 510.621.3687 (google voice)
> _______________________________________________
> yt-dev mailing list
> yt-dev at lists.spacepope.org
> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org