[Yt-dev] Time series data

Fri Aug 21 09:48:38 PDT 2009

Hi everyone,

Stephen and I are at the SciPy 2009 workshop.  I saw a talk yesterday
on Time Series analysis of nueroimaging data ( all the SciPy talks are
online here: http://www.archive.org/search.php?query=scipy but this is
the talk in question:
http://www.archive.org/details/scipy09_day1_04-Ariel_Rokem and Peter
Norvig's is especially good:
http://www.archive.org/details/scipy09_day1_03-Peter_Norvig ) and got
really jazzed up about time series analysis.

So, starting out initially on my own (and later to incorporate
Britton's EnzoSimulation code) I've written up a first pass at a time
series analysis framework.  This, as with all the experimental stuff
that's not ready for trunk, is in the hg repo in the branch
timeseries.

Right now it works as follows -- you instantiate an EnzoTimeSeries,
with an optional name and an output log.  (Currently the entire thing
is populated via the "OutputLog" which is in Enzo1.5 and beyond.)  It
reads in all the parameter files and instantiates them (see below for
why this is not awesome) and then provides some mechanisms for getting
data about and from them.

The architecture is separated into three main sections -- time series
objects, data objects, and analysis tasks.  The first two shouldn't
ever really have to be written by the user; anything that exists as a
data object in yt already (spheres, regions, etc etc) are
automaticlaly converted into time-series data objects.  The analysis
tasks are a bit more user-approachable.  You can see what's already
been created here:

http://hg.enzotools.org/yt/file/timeseries/yt/data_objects/analyzer_objects.py

For example, the MaximumValue task:

class MaximumValue(AnalysisTask):
    _params = ['field']

    def eval(self, data_object):
        v = data_object.quantities["MaxLocation"](
                self.field, lazy_reader=True)[0]
        return v

What this does is say, I will accept a 'field' parameter, and then
inside 'eval' it returns the appropriate value.  (Note that the time
series framework introspects the 'eval' method, so if it accepts a
'pf' instead of 'data_object' it knows what to do.)

So, you might say, whatever, right?  How would I use this?  Well, I've
placed an example file here:

http://hg.enzotools.org/test_scripts/file/tip/time_series.py

from yt.mods import *
from yt.data_objects import *

ts = EnzoTimeSeries("OutputLog")
ret1 = ts.eval([MaximumValue("Density"),
               SlicePlotDataset("Density", 0, [0.5, 0.5, 0.5])
              ])
sp = ts.sphere([0.5, 0.5, 0.5], 1.0)
ret2 = sp.eval([MaximumValue("Density"), CurrentTimeYears()])

So let's examine this one at a time.

The first line:

ts = EnzoTimeSeries("OutputLog")

initializes the time series.  Here I've screwed up the arguments, so
the name is OutputLog, and the location of the OutputLog itself goes
to the default value of ... "OutputLog".

The next line:

ret1 = ts.eval([MaximumValue("Density"),
               SlicePlotDataset("Density", 0, [0.5, 0.5, 0.5])
              ])

Creates two tasks, MaximumValue and SlicePlotDensity, each with their
own arguments.  MaximumValue is told "Density" and SlicePlotDataset
accepts "Density", the axis, and the center.  These are given to the
time series and it's told, go ahead and do this.  It does, and it
returns the list of list of values as ret1 -- the first being the max
value, the second being the slice filenames.

Okay, now, if we keep going:

sp = ts.sphere([0.5, 0.5, 0.5], 1.0)
ret2 = sp.eval([MaximumValue("Density"), CurrentTimeYears()])

The first line here constructs a 'time series data operator' -- a
sphere.  The sphere is centered at (0.5, 0.5, 0.5) and has a radius of
1.0.  We can then call eval on it, just like we did on the time series
object, and it creates and operates on that sphere at every step in
the time series.

This is basically what I was hoping to move toward with yt-2.0 -- a
disconnected operator/data model.  The framework isn't sufficient yet
to do this, but it's getting there.  I'll be merging in Britton's work
wiht EnzoSimulation for grabbing and selecting cosmology information.
Additionally, I'll be splitting out the reliance on OutputLog into a
generalized time series creator, which CAN but NEEDN'T accept
OutputLog.  The population step and the time series will be separate,
and Britton's got a mechanism for populating from an input parameter
file.

Shortcomings:

  * All the hierarchies stick around because the parameter files are
stored.  This won't work for large datasets, as it'll quickly blow out
the ram.  I'm working on a work around that will also handle slicing,
etc.
  * The plotting isn't done yet, but it will be.
  * Everything is in lists right now, because the return values need
not be floating point, but can be strings, objects, 3D arrays,
whatever.
  * Some basic tasks should be built in -- mostly things like
"filename" and "time" and "redshift" which you should be able to just
access and grab.
  * Storing results of evaluation tasks is also not clear to me yet.
Recarrays would be great for fixed-size items, but not objects.  Maybe
the user should just have the option to eval-and-store.
  * Adding tasks requires some object instantiation, and usage of
class names.  Do y'all have any thoughts on this?  Should we move to
the same paradigm used in the callback method, where a plaintext
addressor name is used?  (i.e. modify["grids"] instead of
add_callback(GridBoundaryCallback())...)

I really feel like moving toward time series analysis is the next
major step, and I'd like to continue working on this, as I'll be using
it more and more in the future.  Additionally, I think we could move
toward something of a pipeline, as well -- stringing together inputs
and outputs.  Something like:

my_pipeline = TaskPipeline(MaximumValueLocation("Density"), VirialRadius(), ...)
ts.eval(my_pipeline)

and then have the outputs of tasks fed from one to the next.

Anyway, what do you all think?

-Matt

PS The reorganization should happen around the time we focus on time
series stuff -- which is why it's under data_objects.