[yt-dev] Proposal: Upcast Enzo to 64 bits at IO time
Matthew Turk
matthewturk at gmail.com
Thu Dec 6 13:01:51 PST 2012
Yup, that's right.
On Thu, Dec 6, 2012 at 3:16 PM, Cameron Hummels <chummels at gmail.com> wrote:
> To confirm, you're saying that we should avoid using astype() in future code
> and replace it with asarray instead, and that we don't need to use your list
> comprehension step in our general usage, just in this particular case?
>
> Cameron
>
>
> On Thu, Dec 6, 2012 at 1:12 PM, Matthew Turk <matthewturk at gmail.com> wrote:
>>
>> Hi Cameron and everyone else,
>>
>> Let me clarify one point -- the changes to _get_data_from_grid
>> preserve neutrality on the part of yt about the type of data that is
>> being fed to it, either 32 bit of 64. But, the new version optimizes
>> considerably for the case where the data is not 64-bit. In response
>> to Doug's reply, this particular change is designed to allow other
>> frontends to continue to supply 32-bit data. A change to the
>> GridPatch object itself, making sure that when data gets sent into the
>> field_data dict is 64-bit rather than 32-bit, would eliminate the need
>> for neutrality in _get_data_from_grid in the SmoothedCoveringGrid
>> object. So this still preserves neutrality. Were we to move away
>> from neutrality, we'd see further speedup. The change to the way
>> astype() was being called is going through regardless of the change to
>> the Enzo frontend. The Enzo frontend change is just to ensure that
>> for a particular subset of Enzo data, we see maximum speedup.
>>
>> Cameron, in response to your suggestion, yes, we absolutely should
>> avoid usage of astype() except where copies are explicitly needed.
>> The piece of information that this copies was in my head at some
>> point, but I had forgotten it. The right function *does* exist, and
>> it is np.asarray, but that carried with it more overhead than my
>> (one-off) solution, which I've now also wrappedi nto a list
>> comprehension for another two seconds of savings. It will be worth
>> going through the code base and looking at all of these usages of
>> astype and seeing which can be replaced. One of the reasons this
>> particular usage was so problematic is that it was inside a tightloop,
>> which itself was in a tightloop, and inside another loop (not so
>> tight) and astype() was being called on the *same* arrays over and
>> over. So it was pathologically bad. I've optimized a bit more and I
>> think at this point we're at an algorithmic impasse, which is a
>> subject for another day.
>>
>> Anybody up for an optimization hangout/sprint, we could probably come
>> up with benchmarks and get some changes out pretty quick for
>> performance.
>>
>> -Matt
>>
>> On Thu, Dec 6, 2012 at 3:01 PM, Cameron Hummels <chummels at gmail.com>
>> wrote:
>> > Hey Matt,
>> >
>> > Thanks for optimizing the code--I know it isn't a glorious job, but it
>> > will
>> > help us all in the end. I think your change looks great, but I'm
>> > curious:
>> > is this the sort of operation that is going to consistently come up in
>> > the
>> > future, or a one-off change you're making in the enzo front end? If it
>> > is
>> > something we should be using for future casting, perhaps we should make
>> > a
>> > helper function to do this, so that everyone can consistently use this
>> > optimized code instead of the more common .astype cast. Other than
>> > that,
>> > I'm +1 on this change.
>> >
>> > Cameron
>> >
>> > On Thu, Dec 6, 2012 at 12:50 PM, Matthew Turk <matthewturk at gmail.com>
>> > wrote:
>> >>
>> >> On Thu, Dec 6, 2012 at 2:44 PM, Nathan Goldbaum <nathan12343 at gmail.com>
>> >> wrote:
>> >> > Pardon my ignorance, but is the case that computations done in 64 bit
>> >> > mode in enzo are normally saved to disk as 32 bit floats? If so, is
>> >> > there a
>> >> > setting I can change to make sure that my enzo datasets are always
>> >> > written
>> >> > to disk with double precision?
>> >>
>> >> I disabled that particular anti-feature some time ago, with the
>> >> "New_Grid_WriteGrid.C" stuff. Now in Enzo, you write to disk exactly
>> >> what you store in memory.
>> >>
>> >> >
>> >> > Since most enzo calculations are done in 64 bit anyway and this
>> >> > change
>> >> > allows some pretty significant speedups, I'm +1 on this change.
>> >> >
>> >> > On Dec 6, 2012, at 11:30 AM, Matthew Turk wrote:
>> >> >
>> >> >> Hi all,
>> >> >>
>> >> >> I've been doing some benchmarking of various operations in the Enzo
>> >> >> frontend in yt 2.x. I don't believe other frontends suffer from
>> >> >> this,
>> >> >> for the main reason that they're all 64 bit everywhere.
>> >> >>
>> >> >> The test dataset is about ten gigs, with a bunch of grids. I'm
>> >> >> extracting a surface, which means from a practical standpoint that
>> >> >> I'm
>> >> >> filling ghost zones for every grid inside the region of interest.
>> >> >> There are many places in yt that we either upcast to 64-bit floats
>> >> >> or
>> >> >> that we assume 64-bits. Basically, nearly all yt-defined Cython or
>> >> >> C
>> >> >> operations assume 64-bit floats.
>> >> >>
>> >> >> There's a large quantity of Enzo data out there that is float32 on
>> >> >> disk, which gets passed into yt, where it gets handed around until
>> >> >> it
>> >> >> is upcast. There are two problems here: 1) We have a tendency to
>> >> >> use
>> >> >> "astype" instead of "asarray", which means the data is *always*
>> >> >> duplicated. 2) We often do this repeatedly for the same set of grid
>> >> >> data; nowhere is this more true than when generating ghost zones.
>> >> >>
>> >> >> So for the dataset I've been working on, ghost zones are a really
>> >> >> intense prospect. And the call to .astype("float64") actually
>> >> >> completely dominated the operation. This comes from both copying
>> >> >> the
>> >> >> data, as well as casting. I found two different solutions.
>> >> >>
>> >> >> The original code:
>> >> >>
>> >> >> g_fields = [grid[field].astype("float64") for field in fields]
>> >> >>
>> >> >> This is bad even if you're using float64 data types, since it will
>> >> >> always copy. So it has to go. The total runtime for this dataset
>> >> >> was
>> >> >> 160s, and the most-expensive function was "astype" at 53 seconds.
>> >> >>
>> >> >> So as a first step, I inserted a cast to "float64" if the dtype of
>> >> >> an
>> >> >> array inside the Enzo IO system was "float32". This way, all arrays
>> >> >> were upcast automatically. This led me to see zero performance
>> >> >> improvement. So I checked further and saw the "always copy" bit in
>> >> >> astype, which I was ignorant of. This option:
>> >> >>
>> >> >> g_fields = [np.asarray(grid[field], "float64") for field in fields]
>> >> >>
>> >> >> is much faster, and saves a bunch of time. But 7 seconds is still
>> >> >> spent inside "np.array", and total runtime is 107.5 seconds. This
>> >> >> option is the fasted:
>> >> >>
>> >> >> g_fields = []
>> >> >> for field in fields:
>> >> >> gf = grid[field]
>> >> >> if gf.dtype != "float64": gf = gf.astype("float64")
>> >> >> g_fields.append(gf)
>> >> >>
>> >> >> and now total runtime is 95.6 seconds, with the dominant cost
>> >> >> *still*
>> >> >> in _get_data_from_grid. At this point I am much more happy with the
>> >> >> performance, although still quite disappointed, and I'll be doing
>> >> >> line-by-line next to figure out any more micro-optimizations.
>> >> >>
>> >> >> Now, the change to _get_data_from_grid *itself* will greatly impact
>> >> >> performance for 64-bit datasets. But also updating the io.py to
>> >> >> upcast-on-read datasets that are 32-bit will help speed things up
>> >> >> considerably for 32-bit datasets as well. The downside is that it
>> >> >> will be difficult to get back raw, unmodified 32-bit data from the
>> >> >> grids, rather than 32-bit data that has been cast to 64-bits.
>> >> >>
>> >> >> Is this an okay change to make?
>> >> >>
>> >> >> [+-1][01]
>> >> >>
>> >> >> -Matt
>> >> >> _______________________________________________
>> >> >> yt-dev mailing list
>> >> >> yt-dev at lists.spacepope.org
>> >> >> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>> >> >
>> >> > _______________________________________________
>> >> > yt-dev mailing list
>> >> > yt-dev at lists.spacepope.org
>> >> > http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>> >> _______________________________________________
>> >> yt-dev mailing list
>> >> yt-dev at lists.spacepope.org
>> >> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>> >
>> >
>> >
>> > _______________________________________________
>> > yt-dev mailing list
>> > yt-dev at lists.spacepope.org
>> > http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>> >
>> _______________________________________________
>> yt-dev mailing list
>> yt-dev at lists.spacepope.org
>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>
>
>
> _______________________________________________
> yt-dev mailing list
> yt-dev at lists.spacepope.org
> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>
More information about the yt-dev
mailing list