[Yt-dev] Fwd: ellipsoid part 2

Sat Sep 10 04:45:35 PDT 2011

Hi Geoffrey,

Let me start with the end of the story: the basic code is ready for
inclusion, so if you'd like, make the changes to your bitbucket repo
and issue a pull request and we can bring it in.  Thank you so much
for your efforts!  (And please also feel free to add your name to the
CREDITS file when you do so.  :)

There were a couple things you mentioned so I'll take the time to
reply to those, as well.

On Fri, Sep 9, 2011 at 10:13 PM, Geoffrey So <gsiisg at gmail.com> wrote:
> Matt, Stephen,
> Thanks Matt for all the suggestions.  I've made most of the changes, but
> some I wasn't quite sure what you meant:
> 1) Use booleans for parameters such as use_halos.  True and False, not 0 and
> 1.
> - Done
> 2) Iterate over the halos:
> - Done (I assume this is faster because it doesn't try to find which halo
> every time I call it?)

It's just more ... "pythonic" if you will.  It does speed things up
marginally for the reason you note, but it's also conceptually a bit
cleaner.

> 3) Index variables should be clearly index variables; "halo" is an
> object, but you use it as an index.
> - Done, switched to calling halo_num to avoid ambiguity

Cool.  One fun thing you can do is avoid having index variables at all
by using the python function enumerate:

for halo_num, halo in enumerate(halos):
    ...

This will automatically give you an index, halo_num.  You don't need
to worry about it here, though.

> 4) cm => com
> - Done
> 5) You set ax to be an array twice, which will double the memory
> requirements.  I don't think ax needs to be an array at all.
> - Switched to using list [], not sure why an array of arrays is better than
> list of arrays, but I take your word for it

This came up a couple times in your email, so I'll address it here.
The issue comes down to 'when is an array useful'.  Typically you want
to use them when you are performing some operation in lockstep --
really, an array operation.  However, there is a cost to creating
them.  For instance, when you execute this operation:

my_array = na.array([ arr1, arr2, arr3 ])

You are implicitly performing the following three actions in sequence:

1) Creating a list, with three components: arr1, arr2, arr3.
2) Calling the na.array function, which will allocate space for an
array that is of shape (3, arr1.size).
3) Copying the elements of arr1, arr2, arr3 into that new array, and
binding it to my_array.

So at the end you have consumed at least twice the memory.  The first
operation will use pointers to the underlying objects; the last one
will not.  Now, if you take it one step further:

my_array = na.array([arr1**2, arr2**2, arr3**2])

You add in some additional steps:

1) Create a temporary array of size arr1.size, and filling it with arr1**2
2) Same, for arr2
3) Same, for arr3)
4) Creating list
5) Allocating (3, arr1.size)
6) Copying
7) De-allocating our temporary arrays

So we have now *tripled* our memory at peak usage.  Nominally, this is
not usually a problem; however, if you do this many many times with
very large arrays, it will slow things down.

Returning to the question of, when do you want to use arrays, if you
are doing something that *uses* the arrays.  For instance, you need a
3D array if you want to do fast sums along an axis, or very fast
iteration along all axes, or sometingl ike that.  But in the code you
presented, this isn't really done: almost everything is done one
column at a time.  So the cost, of time and peak memory usage, is not
*necessarily* worth it.  That being said, it's a small point.

> 6) Line 109, just modify ax in place.  Fancy-indexing in numpy will (I
> believe) implicitly make a copy when you modify:
>
> In [1]: import numpy as na
> In [2]: arr = na.random.random(100)
> In [3]: ai = arr > 0.8
> In [4]: bb = arr[ai]
> In [5]: bb -= 1.0
> In [6]: arr.max()
> Out[6]: 0.99642788644174751
> In [7]: bb.max()
> Out[7]: -0.0035721135582524877
>
> - Not quite sure was meant by the example

Ah, just that if you fancy index something, you can modify it in place
and on the backend it will copy the changed values.

> 7) Line 113, use range or xrange, and use singular "axis" not "axes"
> - Done
> 8) Line 114, you double the memory by making a new array again.
> - Done, switched to list
> 9) Line 117, abs is a numpy function.
> - Done, switched to na.abs()
> 10) Line 121, this makes four new arrays, each of which is the length
> of the number of particles.  Construct a temporary array and do
> in-place math.  You don't need "where" here; use booleans.
> - Old code left from my using copy and paste from the testing phase.
> Most things if used only once I avoid using memory by calling it a name
> 11) Your calculation of the inertial tensor is nice, but I would again
> recommend that you avoid creating four temporary arrays when
> calculating the magnitude of 'ax'.
> - Not changed, since this is the fastest way I can think of.  Since I
> realized the memory requirement of particles is much less than fields,
> memory should not be the main concern here but speed.  But I'm all
> ears if you or Stephen have a faster/efficient way.

Nah, this is fine.

> 12) I would encourage you to investigate the eigenvector code if you
> suspect it gives LH system coordinates.
> - Not done
> 13) "for iter in range(0, maxiterations)") should just be for iter in
> xrange(maxiterations).  There is no need to specify 0.
> - Done
> 14) Please use print without parenthesis.  Python 3 has transformed
> print into a function, but the formatting rules are also different.
> - Done
> 15) Avoid using where!
> - Done
> 16) use .size instead of len() for arrays
> - Done
> 17) Feel free to throw an error.  Define a subclass of Exception and
> then raise an instance of it.
> - Will look into how to write an error class in YT..., very useful I think
> but want to see if the changes I have so far is adequate.

They're adequate.  :)  We can add in an error at the end after it
comes in.  Time to bring the code into yt and test it out in the wild.

>
> updated script:
> http://paste.enzotools.org/show/1799/
> So the ones I have trouble with are #6, 10, 11, all involving the phrase "in
> place".  I had to wiki the phrase because I'm not familiar with it, and I
> think you mean "input is overwritten by output" as per
> http://en.wikipedia.org/wiki/In-place_algorithm

Ah, in-place being something like:

a += 1

But more than that, if you want to do it extra-quickly, you can do
operations like:

na.add(a, 1, a)

where you specify input, operand, output.  But don't worry about it.

> When I solve for the 3 different cases of boundary condition, I had to
> double (1/3 axes * 3 cases = 1 extra) the particle position memory, because
> I only check 1 axis at a time, so it's not as bad as if I do all 3 axes and
> 3 cases at once.  I can paste the code bit of "cases" into the next line
> where it  is used twice by na.choose().  That should make the code "in
> place" but it would be a lot less readable.  And since "cases" is used
> twice, having it in memory also makes things easier.  I just wanted to
> clarify why it is left the way it is, but if you still think the in place
> method is better, I can make the change easily.
> Another point which I didn't understand was the difference between using
> array of array compare to list of arrays.  From what I read
> http://stackoverflow.com/questions/176011/python-list-vs-array-when-to-use
> people seem to prefer list over array, and use arrays only when the same
> data type is absolutely require.  Another point is that arrays are more
> memory efficient, because lists uses extra byte incase we want to append
> things to it etc, but if we want to do math to it or modify it, it is better
> use lists.  In my case I multiply the rotation matrix to the xys positions,
> so I should use list because of doing math.  Is that the correct
> understanding?

You are correct, yes, lists in general are actually less memory
efficient.  For instance, we would never want our entire particle
arrays in lists.  But the difference that I was trying to get at is
that lists can operate with pointers -- arrays cannot.  If you create
a new array out of old arrays, they will be copied into a new location
in memory; with lists, only pointers to those objects are used.  Does
that make sense?

Anyway, thanks very much -- for your work, and your research on the
topic.  :)  Issue a pull request when you're ready; this will be a
great addition.

-Matt

> From
> G.S.
> ---------- Forwarded message ----------
> From: Matthew Turk <matthewturk at gmail.com>
> Date: Tue, Sep 6, 2011 at 4:11 AM
> Subject: Re: [Yt-dev] ellipsoid part 2
> To: yt-dev at lists.spacepope.org
>
>
> Hi Geoffrey,
>
> Great to hear from you again on this project!  I have several comments
> about the code interspersed; they are all relatively minor, but there
> are several.
>
> On Mon, Sep 5, 2011 at 10:23 PM, Geoffrey So <gsiisg at gmail.com> wrote:
>> Hello Stephen,
>> This is part 2 of the ellipsoid parameter finding.  The method is outlined
>> mainly in the Dubinski and Carlberg 1991 paper.  But since we are doing
>> this
>> in YT and we already know the particle list to begin with, I put in the
>> option that the user can specify if they want all the particles in the
>> region to be taken into account or only the particles found by the halo
>> finder algorithm. This is an option that I think should be amended to the
>> first ellipsoid parameter finder.
>
> Is this the parameter halo_only?  I would like to note that
> creation_time < 0 is not reliable, and will make your code
> Enzo-specific.  I recommend for now using the boolean selector IsStar:
>
> dm_index = dd["IsStar"]
>
> Additionally, I would recommend not constructing a new array (which
> will be slower than allowing the array selections to be left as is)
> and to instead set ax (which should be renamed, as 'ax' is
> idiomatically used in yt to reference an axis) to a list.
>
> I would go a bit further and suggest you not use the parameter
> halo_only as it is implicit in the halo finding.  If the halo finding
> subselected on dark matter, then you would want that in the ellipsoid.
>  If it didn't subselect, then you probably also don't want to
> subselect in the ellipsoid finding.
>
>> The script form is linked here:
>> http://paste.enzotools.org/show/1782/
>> Once I have written out the methodology for myself, I'll work on the YT
>> documentation, will probably need some help in this area when it comes to
>> it.
>
> I have a few comments:
>
> 1) Use booleans for parameters such as use_halos.  True and False, not 0 and
> 1.
> 2) Iterate over the halos:
>
> for halo in halos:
> 3) Index variables should be clearly index variables; "halo" is an
> object, but you use it as an index.
> 4) cm => com
> 5) You set ax to be an array twice, which will double the memory
> requirements.  I don't think ax needs to be an array at all.
> 6) Line 109, just modify ax in place.  Fancy-indexing in numpy will (I
> believe) implicitly make a copy when you modify:
>
> In [1]: import numpy as na
> In [2]: arr = na.random.random(100)
> In [3]: ai = arr > 0.8
> In [4]: bb = arr[ai]
> In [5]: bb -= 1.0
> In [6]: arr.max()
> Out[6]: 0.99642788644174751
> In [7]: bb.max()
> Out[7]: -0.0035721135582524877
>
> 7) Line 113, use range or xrange, and use singular "axis" not "axes"
> 8) Line 114, you double the memory by making a new array again.
> 9) Line 117, abs is a numpy function.
> 10) Line 121, this makes four new arrays, each of which is the length
> of the number of particles.  Construct a temporary array and do
> in-place math.  You don't need "where" here; use booleans.
> 11) Your calculation of the inertial tensor is nice, but I would again
> recommend that you avoid creating four temporary arrays when
> calculating the magnitude of 'ax'.
> 12) I would encourage you to investigate the eigenvector code if you
> suspect it gives LH system coordinates.
> 13) "for iter in range(0, maxiterations)") should just be for iter in
> xrange(maxiterations).  There is no need to specify 0.
> 14) Please use print without parenthesis.  Python 3 has transformed
> print into a function, but the formatting rules are also different.
> 15) Avoid using where!
> 16) use .size instead of len() for arrays
> 17) Feel free to throw an error.  Define a subclass of Exception and
> then raise an instance of it.
>
> One fun thing would also be to plot projections of "Ones" using these
> ellipsoids as sources.  That would help us see them better, too!
>
>> Doron Lemez has used this method in his paper and helped me tested the
>> accuracy of this script.  I was wondering do I need to ask him if he wants
>> his name in the credits, or just mention his name as a special thanks?
>
> Which credits do you mean?  The CREDITS file is reserved for code
> contributions.  Feel free in your documentation to mention that he
> assisted you, if you feel it is warranted.  Please be sure that your
> documentation includes the ".. sectionauthor" directive to indicate
> that you wrote it.
>
> Please issue a pull request once you have included this routine into
> the halo finding module; we can directly review it then.
>
> Additionally, for the documentation modifications, I would encourage
> you to fork yt-doc and make your changes in source/analysis_modules/
> under the halo finding section.  Include a few (small file size)
> images, too.
>
> Thanks, GS!
>
> -Matt
>
>>  One
>> thing he did in the paper that was especially useful is defined the
>> starting
>> sphere to have 2x the virial radius.  Without that specification, I run
>> into
>> non-convergence sometimes, where after many iterations I end up with no
>> particles.  And I want to mention again that there's no guarantee that any
>> or all the haloes will converge with this method.
>> I tried to adhere to the PEP8 formatting, but old habits die hard so I'm
>> sure I made mistakes in the formatting, let me know if you see them.
>> From
>> G.S.
>> _______________________________________________
>> Yt-dev mailing list
>> Yt-dev at lists.spacepope.org
>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>
>>
> _______________________________________________
> Yt-dev mailing list
> Yt-dev at lists.spacepope.org
> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>
>
> _______________________________________________
> Yt-dev mailing list
> Yt-dev at lists.spacepope.org
> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>
>