[yt-dev] sqlite parameter files storage

Mon Dec 19 14:26:05 PST 2011

Hi Stephen,

One computer I use regularly (JuRoPa) does use lustre for the home directory.

It would be great if any database solution you come up with does work on a parallel filesystem.

Thanks,

Nathan

On Dec 19, 2011, at 2:14 PM, Stephen Skory wrote:

> Hey all (Matt, in particular),
> 
> I've got some dead time waiting for jobs to run so I'd like to A)
> discuss this topic and B) make some "final" decisions about this so I
> can go ahead and do some coding on this. Sorry about the length of
> this!
> 
> A) Briefly, for those of you who aren't aware of this topic, the idea
> is to replace the ~/.yt/parameter_files.csv text file with a SQLite
> database file. This has many advantages over a text file, too many to
> list here. But in particular, it has built-in locks for writes (*),
> which is especially useful for multi-level parallelism. This is
> something we're currently addressing in "official" examples with a
> kludge [0]. I think everyone is in agreement that this is a good
> thing, no?
> 
> The other big thing that this feeds into is a remote, centralized
> storage point for a clone of this database. I've discussed this idea
> before, sketched up a simple partially functional example, and made a
> simple video cast of how it works. [1]
> 
> B) The final decisions that I'd like input on are these.
> 
> - What data fields should we include in the databases? There are three
> ways to go with this.
>  #1. The same amount of data that is in the current csv (basically:
> hash, name, location on disk, time, type). This is probably too few
> data fields, so I think we can scratch it off immediately.
>  #2. Everything that can be gleamed from the dataset. This is
> actually fine to do practically because of the database being binary
> and searchable. However, because the fields in various datasets are so
> different, this could result in a fairly unwieldy database with (in a
> Chris Traeger voice) a literal ton of columns. This could be mitigated
> by having a different database tables for each type of dataset (Enzo,
> Athena, etc...), but that really only swaps one kind of complexity for
> another.
>  #3. A minimal set of "interesting" fields (redshift, box resolution,
> cosmological parameters, etc..) This is more attractive than #2 in
> that it's very unlikely anyone will want to search over every field in
> a dataset, so it keeps things more streamlined. But then we have to
> agree to a reasonable set of parameters to include, and it makes
> future changes a bit more difficult.
> 
> What do we all think?
> 
> - Once we have the above settled, and working, I would like to extend
> the functionality to the cloud bzzzzzzz. Get it? It's a buzz word. So
> it buzzes. Thanks, I'll be here all week.
> 
> There are three (four) ways to do this that I can think of
> 
>  #1. Amazon Simple DB. The advantages of this is that it's offered
> free to all up to 1GB of storage and some reasonable limit of
> transactions per month. Each user sets up her own account on S3, and
> no one else has to be involved. But the main disadvantage is that it
> only supports storing things as strings, which makes numerical
> searches and sorts less useful, more annoying, and slower.
>  #1.5. Amazon Relational DB. This is not free at any level, but it
> offers all the usual DB functionality. Amazon does offer some
> educational grants, so we could apply for that. This service is
> targeted at usage levels that we will never reach, but if we get free
> time, that's fine. I think in this case (and the next two) user
> accounts on the database would have to be created for yt users by
> "us".
>  #2. Google App Engine. Free right now in pre-beta invitation-only
> phase. It will be similar or #1.5 above, as I understand things, and
> not be free forever. Personally, I seriously doubt that we'd get in on
> the pre-beta. I've looked at the application form [2] and I don't even
> understand one of the questions.
>  #3. Host a MySQL (or similar) database on one of our own servers
> (yt-project or similar). The advantage is that the cost should be no
> more that Matt is paying now. The disadvantage is, again, we have to
> set up accounts. Also, I don't know if Dreamhost (is that where
> yt-project is still?) allows open MySQL databases. Another advantage
> is that unlike #1.5 or #2 above, costs should never rise suddenly when
> an educational grant or beta period ends.
> 
> Thanks for reading, and any and all comments are welcomed.
> 
> [0] http://yt-project.org/doc/advanced/parallel_computation.html#parallelizing-your-analysis
> [1] http://vimeo.com/28797703
> [2] https://docs.google.com/spreadsheet/viewform?formkey=dHBwRmpHV2VicFVVNi1PaFBvUGgydXc6MA
> 
> (*) There are issues with locks on parallel network file systems, but
> most home partitions on supercomputers are NFS (not something like
> Lustre) so this shouldn't be a problem.
> 
> -- 
> Stephen Skory
> s at skory.us
> http://stephenskory.com/
> 510.621.3687 (google voice)
> _______________________________________________
> yt-dev mailing list
> yt-dev at lists.spacepope.org
> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
> 
> !DSPAM:10175,4eefb77423131608616048!
>