[yt-dev] Proposal: Upcast Enzo to 64 bits at IO time

Thu Dec 6 11:44:15 PST 2012

Pardon my ignorance, but is the case that computations done in 64 bit mode in enzo are normally saved to disk as 32 bit floats?  If so, is there a setting I can change to make sure that my enzo datasets are always written to disk with double precision?

Since most enzo calculations are done in 64 bit anyway and this change allows some pretty significant speedups, I'm +1 on this change.

On Dec 6, 2012, at 11:30 AM, Matthew Turk wrote:

> Hi all,
> 
> I've been doing some benchmarking of various operations in the Enzo
> frontend in yt 2.x.  I don't believe other frontends suffer from this,
> for the main reason that they're all 64 bit everywhere.
> 
> The test dataset is about ten gigs, with a bunch of grids.  I'm
> extracting a surface, which means from a practical standpoint that I'm
> filling ghost zones for every grid inside the region of interest.
> There are many places in yt that we either upcast to 64-bit floats or
> that we assume 64-bits.  Basically, nearly all yt-defined Cython or C
> operations assume 64-bit floats.
> 
> There's a large quantity of Enzo data out there that is float32 on
> disk, which gets passed into yt, where it gets handed around until it
> is upcast.  There are two problems here: 1) We have a tendency to use
> "astype" instead of "asarray", which means the data is *always*
> duplicated.  2) We often do this repeatedly for the same set of grid
> data; nowhere is this more true than when generating ghost zones.
> 
> So for the dataset I've been working on, ghost zones are a really
> intense prospect.  And the call to .astype("float64") actually
> completely dominated the operation.  This comes from both copying the
> data, as well as casting.  I found two different solutions.
> 
> The original code:
> 
> g_fields = [grid[field].astype("float64") for field in fields]
> 
> This is bad even if you're using float64 data types, since it will
> always copy.  So it has to go.  The total runtime for this dataset was
> 160s, and the most-expensive function was "astype" at 53 seconds.
> 
> So as a first step, I inserted a cast to "float64" if the dtype of an
> array inside the Enzo IO system was "float32".  This way, all arrays
> were upcast automatically.  This led me to see zero performance
> improvement.  So I checked further and saw the "always copy" bit in
> astype, which I was ignorant of.  This option:
> 
> g_fields = [np.asarray(grid[field], "float64") for field in fields]
> 
> is much faster, and saves a bunch of time.  But 7 seconds is still
> spent inside "np.array", and total runtime is 107.5 seconds.  This
> option is the fasted:
> 
>        g_fields = []
>        for field in fields:
>            gf = grid[field]
>            if gf.dtype != "float64": gf = gf.astype("float64")
>            g_fields.append(gf)
> 
> and now total runtime is 95.6 seconds, with the dominant cost *still*
> in _get_data_from_grid.  At this point I am much more happy with the
> performance, although still quite disappointed, and I'll be doing
> line-by-line next to figure out any more micro-optimizations.
> 
> Now, the change to _get_data_from_grid *itself* will greatly impact
> performance for 64-bit datasets.  But also updating the io.py to
> upcast-on-read datasets that are 32-bit will help speed things up
> considerably for 32-bit datasets as well.  The downside is that it
> will be difficult to get back raw, unmodified 32-bit data from the
> grids, rather than 32-bit data that has been cast to 64-bits.
> 
> Is this an okay change to make?
> 
> [+-1][01]
> 
> -Matt
> _______________________________________________
> yt-dev mailing list
> yt-dev at lists.spacepope.org
> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org