[Yt-dev] Parallelism, or, how I learned to stop worrying and love open source development

Sat Aug 21 09:38:00 PDT 2010

Hi Brian,

I'm sure everyone's sick of my replies, but I owe you one on this.

> I agree strongly with Matt's point that the level of parallelism of various
> functions would be useful for the end user (case in point: we were startled
> to learn that we're probably wasting our time doing slices in parallel).

I've added a ticket for adding a list of effective parallel tasks.

(As a quick note to hammer this home, I  would never, ever encourage
someone to slice in parallel -- yt is exclusively load on demand.
Slicing  the 512^3 L7 is feasible in serial on a laptop.  Slices in
parallel are useful for when the data is already distributed, i.e.,
inline.)

>  A
> small table of function vs. parallelism would be great:  "scales
> embarrassingly well", "scales somewhat - use carefully", "should be used in
> serial", and "must be used in serial" would be useful.  This is probably an
> oversimplification, but some footnotes would help:  "If you want to do a
> projection through a very large simulation, use fixed resolution buffers."
> A similar estimate of memory usage for problems would be very handy, at
> least for the largest calculations.  Furthermore, a couple of example batch
> scripts would go a long way - a Kraken batch script for Parallel HOP made
> its way to Brian Crosby and I, and we found that very informative.

We should provide more examples of usages on TeraGrid resources.

I have been stewing on the idea of a 'yt kit.'  I'll see if I can put
that more clearly into words/code sometime soon.

> If this is done, making a practice of posting a recap to the list
> after a bug is solved would be useful, since Matt assures me that the lists
> are archived.

They are!  :)

> Regarding expectations vs. obligations, I think that it is appropriate for
> developers to fix bugs in features they create, and to give users some idea
> of the resources that those features require.  On the other side of the
> coin, non-developer users (like myself) are obligated to give feedback on
> features: what's useful?  what's not?  what appears to be broken?  My
> observation has been that a few of the yt features could become vastly more
> useful to people other than their developer if a few widgets were added.
> This is generally not trivial, but is far easier for the original developer
> than for somebody who is new to yt and python in general, and a given yt
> feature in particular.  Documentation and examples are also key - I
> personally have found the cookbook to be invaluable!

I'm glad the cookbook has helped; I'm hopeful that we can improve it a
bit, as well.  Oliver and I have chatted about analysis modules, which
ties into the 'yt kit' and which I am going to explore a bit more
later.

As for problems with things, I'm now encouraging that these get
recorded in bug reports.  I sympathize with what you say about things
that are *almost* useful.  :)

Thanks very much for bring this all out into the light, and I hope
that we can take this as a starting point for improving the code, the
docs, and our community process.

Best,

Matt

>
> Anyway, that's my $0.02 of coffee-fueled ramblings.
>
> --Brian
>
> On Fri, Aug 20, 2010 at 3:04 PM, Britton Smith <brittonsmith at gmail.com>
> wrote:
>>
>> Hi everyone,
>>
>> I would like to chime in on some of the issues Matt has raised.  These are
>> very important things to think about, which is why I stayed up all night to
>> read the whole email.
>>
>> I will mostly stay out of the parallelism issue, but I'll only add that I
>> have been doing projections of 1024^3 unigrid data on kraken with 64 cores.
>> They have gone fine for me, taking roughly 10 seconds or so each.  I also
>> think an explicit list of actions that do not run in parallel is a really
>> good idea.
>>
>> On the bugs issue, it is not clear to me how a new user can tell the
>> difference between a bug and simply doing something wrong.  Either way, I
>> think that the users list is succeeding in getting people's issues solved,
>> provided that the issues make it there.  I think we really need to encourage
>> all users to be taking all potential issues through the list first, even if
>> the resolution eventually takes place off-list.  Even if the feature author
>> is doing all of the talking, this makes any new knowledge public, and allows
>> other people to help out if they can.  Most people already do this, but I
>> would suggest again that we ask that any requests for help we receive
>> directly be resent to the user list.
>>
>> On the community, there clearly needs to be a balance between what users
>> expect to get from the code and what the developers are obligated to
>> provide.  With exceptions, a vast majority of those contributing code are
>> not doing so in their spare time.  Much of this code is related to their own
>> work, but a nonzero amount is stuff that simply needs to get done.  Users of
>> the code need to recognize this.  However, at the same time, we as
>> developers need to hold ourselves to some standards, namely, if we say the
>> code does something, it better do it.  Clearly, there are situations where
>> we can not deliver on this, at least not right away.  In general though, we
>> need to be clear about what the code will and will not do, and see that our
>> statements are and remain true.
>>
>> I think we should consider setting up a wish list, where users can submit
>> ideas for features that they would like see added to the code.  This should
>> be viewable by everyone.  I think this might help people stay conscious of
>> the fact that if they want something, that means someone has to physically
>> go and do it for them.  Maybe this will even give people the notion that
>> they can do it on their own.
>>
>> Britton
>>
>> On Thu, Aug 19, 2010 at 11:54 PM, Matthew Turk <matthewturk at gmail.com>
>> wrote:
>>>
>>> Hi all,
>>>
>>> I'm going to top post, which I guess I do more than I ought to anyway,
>>> because I'm going to try to address a number of issues that have been
>>> brought up.  I've spent some of the day thinking about this issue, and
>>> what it says about yt as a community and about my level of involvement
>>> in various areas.
>>>
>>> So, I'll touch on those at the end, but first I'll hit back on the
>>> issue of parallelism and how to address it.
>>>
>>> = Parallelism =
>>>
>>> I think what is becoming clear is that the step from serial to
>>> parallel, in terms of user experience, should be more well-handled
>>> than it currently is.  As it stands, the section in the manual that
>>> covers parallelism basically says, "These things work, go ahead and
>>> give it a go!"  This is my fault, and it's not really sufficient.
>>> More detail has to be given, and rather than a whitelist of actions
>>> that are parallel safe we need to also include a *blacklist*.
>>>
>>> The second step we need to take is provide examples of how to submit a
>>> parallel job -- how much it requires in terms of resources and so on.
>>> Unfortunately, it's not entirely clear to me the best way to organize
>>> the documentation, and I don't even really know where this would go.
>>> Stephen did a really rad job of doing this in the halo finding paper,
>>> and he's done an excellent job with his work on the halo finder as a
>>> whole.  (It's just that last 5% toward the user experience, I think.
>>> :)  My own work on the parallel projections should be better
>>> documented and the UX there should be improved as well.
>>>
>>> The third is to keep an eye on memory usage.  Memory profiling is
>>> difficult, but it's something we have tried before and that I believe
>>> needs to be re-examined.  Specifically, it seems that both projections
>>> and the parallel halo finder suffer from this problem.  As a note,
>>> next week I will be spending some time swapping out the old projection
>>> method for the new quad-tree method.  This should improve both speed
>>> and memory usage.
>>>
>>> Okay, on to the larger problems that I think this relates to.
>>>
>>> = Bugs =
>>>
>>> First off, we need a mechanism for handling and bugs.  I don't want to
>>> use the word "triage" here, but it is becoming clear that we need a
>>> mechanism.  Currently, we have a Trac site that really doesn't get
>>> used at all.  I've explored a couple mechanisms for encouraging bug
>>> reports.
>>>
>>>  * I can enable OpenID login -- this means using something like your
>>> GoogleName to log in and report a bug.
>>>  * I've already replicated the .htpasswd between mercurial and the
>>> Trac site, so anyone who has a report there can log in to the Trac
>>> site.
>>>  * yt could register a default excepthook that encourages the user to
>>> report a bug.  I'm leery of this because I'm not sure I want to muck
>>> about with Python internals that much, but it could be done nicely, I
>>> think.
>>>
>>> Overall, though, what really needs to happen is some kind of *buy-in*
>>> on the part of the user -- which in this case is anyone who has had
>>> trouble with yt.  I have pulled back from yt-users, and I'm really
>>> happy that everyone else has stepped up.  But I'm worried that as time
>>> goes on, people will pick up knowledge in ways that aren't indexable
>>> by search engines and then this knowledge keeps getting re-learned.
>>>
>>> Public reporting of bugs, particularly as it could relate to
>>> improvements in documentation, is essential.  But this can't happen if
>>> it's just driven by one or two people.  And if no one else is
>>> motivated to encourage this, then perhaps that's just where we'll
>>> stay.  I can't force buy-in, I can only encourage people to see the
>>> benefits to reporting bugs, sharing experiences, and all of that.  We
>>> need to have people to read and handle bugs, and then people to whom
>>> they apply.  I really would like for this not always to be me.
>>>
>>> Anyway, if you have an hg account, you can login:
>>>
>>> http://yt.enzotools.org/login/
>>>
>>> and then report a bug:
>>>
>>> http://yt.enzotools.org/newticket
>>>
>>> It helps if you paste the traceback with --paste on the command line
>>> of your script.
>>>
>>> = Fixing Bugs =
>>>
>>> We've had great success with people taking ownership of different bugs
>>> on the mailing list and fixing them.  This is a huge success story,
>>> and I thank all the developers that have made this happen.  But I
>>> think it's important that we continue to develop this sense of
>>> ownership through the Trac site.
>>>
>>> = Major Enhancements =
>>>
>>> Adding on major enhancements is unfortunately an open problem.  For
>>> instance, I would really like to see the parallelism framework
>>> essentially rewritten to be more modular and to take advantage of
>>> nested MPI communicators.  I have a sketch of how this would go, and
>>> I've even written some code.  But, I'm not employed to work on yt.  I
>>> mostly develop it either as it suits my research interests (and I am
>>> operating under the working assumption that this is true for everyone
>>> else) or as I find it something fun to do in the evenings.  I want it
>>> to be used, and to be useful, and I believe that my stewardship of the
>>> project up to this point supports this conclusion.
>>>
>>> I *truly* do believe in cross-code simulation analysis, sharing
>>> facilities with other users, and reproducible research.  But I am
>>> reaching the limits of what I, alone, can do.  So far we've had some
>>> pretty major contributions from a number of developers, but I think
>>> it's important that we communicate to the community that this is still
>>> a volunteer project.
>>>
>>> We don't have a team of dedicated software developers, we have a
>>> handful of scientists who are working to both further their own
>>> research interests while providing the best user experience possible
>>> for an advanced analysis code.  And, to be perfectly frank, I think
>>> we're doing a pretty darn good job on both of those fronts.  Many
>>> people now use yt on a daily basis to analyze simulation outputs from
>>> several different codes.  We've got advanced analysis and viz
>>> functionality, thanks to *you* developers, that has been published a
>>> dozen papers, been shown at the Adler Planetarium, taken home the
>>> third place at the SciDAC visualization "Oscars," and even (ever so
>>> briefly) been on the Discovery channel.
>>>
>>> But, still, we have to keep our eye on the prize.  And if the prize
>>> the other developers have *their* eye on isn't the prize *you* have
>>> your eye on, unfortunately some responsibility will fall back on to
>>> your shoulders.  I honestly wish I could spend more time helping
>>> others use yt, developing yt, and building it to be the tool I really
>>> wish it would be.  Don't think that I don't see all the warts and
>>> problems that you all see -- I do.  In the docs, the source code, the
>>> functionality, the user experience ... I see the warts too.
>>>
>>> But even though developing yt is fun, I'm still developing it because
>>> I'm a scientist who wants to ask questions of his data.
>>>
>>> = Building Community =
>>>
>>> We've done a good job of this, but it's becoming clear that there's a
>>> disjoint:
>>>
>>>  * We're doing a mediocre job of shepherding users into being
>>> contributing developers.  I'd like to help fix this by writing up more
>>> suggestions on how to develop and share your changes.  yt will
>>> stagnate if we don't continue to churn the developer list.
>>>  * We need to articulate the vision for yt, and I'm not sure my vision
>>> is the one anyone else has.
>>>
>>> I'd love to hear suggestions about this aspect.
>>>
>>> = Documentation =
>>>
>>> Any help anyone can give with documentation would be great.
>>> Organization, notes, suggestions, anything.  Report it as a bug.
>>> Commit changes.  Email the list.
>>>
>>> ==
>>>
>>> Anyway, that's basically what I've been thinking about, and what I
>>> wanted to say.  I think we have an opportunity with yt to build a real
>>> community of collaboration and sharing of resources.  And we've done a
>>> great job with that so far.  But it still has to be something of a
>>> jumpstart approach -- jumpstarting development and then encouraging
>>> others to pick up the torch and run with it.  Grass roots,
>>> science-driven development is kind of the name of the game here.
>>>
>>> And when there *are* problems, I'm sure that lots of people are eager
>>> to jump at helping you fix them.  But we have to hear about 'em before
>>> we can.  :)
>>>
>>> Thanks,
>>>
>>> Matt
>>>
>>> On Thu, Aug 19, 2010 at 1:12 PM, Stephen Skory <stephenskory at yahoo.com>
>>> wrote:
>>> > Hi Brian & Eric,
>>> >
>>> >>As you know (since we discussed it off-list), I'm the reason for this
>>> >> being
>>> >>mentioned to you.  I had some pretty horrible problems with the various
>>> >>incarnations of HOP in yt being excruciatingly slow and consuming huge
>>> >> amounts
>>> >>of memory for a 1024^3 unigrid dataset, to the point where my grad
>>> >> student and I
>>> >>
>>> >>ended up just using P-GroupFinder, the standalone halo finder that
>>> >> comes with
>>> >>week-of-code enzo.  Note that when I say "excruciatingly slow" and
>>> >> "consuming
>>> >>huge amounts of memory", I mean that when we used 256 nodes on Ranger,
>>> >> with 2
>>> >>cores/node (so 512 cores total) for the 1024^3 dataset, it still ran
>>> >> Ranger out
>>> >
>>> >>of memory, or, alternately, didn't finish in 24 hours.
>>> >
>>> > A few notes in response:
>>> >
>>> > - Recently I ran a 2048^3 dataset on 264 cores that took about 2 hours
>>> > which
>>> > averaged about 8.5GB per task with a peak task of 10 GB. Your job is
>>> > 1/8 the
>>> > size and should have run, and I don't know why it didn't.
>>> >
>>> > - If I wasn't trying to graduate I would have had more time to assist
>>> > when your
>>> > student (Brian) asked me for help. I'm sorry so much of your time was
>>> > wasted.
>>> >
>>> > - My tool as a public tool is not any good unless other people can use
>>> > it too.
>>> > Clearly I need to do some work on that.
>>> >
>>> > - It *does* use much more memory than it needs to, you are right. I
>>> > know where
>>> > the problems are, and whoo-boy they are there, but they are not easy to
>>> > fix.
>>> >
>>> > - Speed could be better, but some of this has to do with how HOP itself
>>> > works.
>>> > For example, it needs to run the kD tree twice, unlike FOF which needs
>>> > to only
>>> > once. The final group building step is a "global" operation, so that's
>>> > slow as
>>> > well. On 128^3 particles, (normal) HOP takes about 75 seconds, and FOF
>>> > about 25.
>>> > The C HOP and FOF in yt both use the same kD tree, same data I/O
>>> > methods, so
>>> > that's a fair ratio of the increased workload.
>>> >
>>> >
>>> >  _______________________________________________________
>>> > sskory at physics.ucsd.edu           o__  Stephen Skory
>>> > http://physics.ucsd.edu/~sskory/ _.>/ _Graduate Student
>>> > ________________________________(_)_\(_)_______________
>>> >
>>> > _______________________________________________
>>> > Yt-dev mailing list
>>> > Yt-dev at lists.spacepope.org
>>> > http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>> >
>>> _______________________________________________
>>> Yt-dev mailing list
>>> Yt-dev at lists.spacepope.org
>>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>
>>
>> _______________________________________________
>> Yt-dev mailing list
>> Yt-dev at lists.spacepope.org
>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>>
>
>
> _______________________________________________
> Yt-dev mailing list
> Yt-dev at lists.spacepope.org
> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>
>