[Yt-dev] Parallelism, or, how I learned to stop worrying and love open source development

Sat Aug 21 00:28:01 PDT 2010

Hi Britton,

Thanks for your thoughtful reply.  I'm going to address some of the
technical aspects.

> I will mostly stay out of the parallelism issue, but I'll only add that I
> have been doing projections of 1024^3 unigrid data on kraken with 64 cores.
> They have gone fine for me, taking roughly 10 seconds or so each.  I also
> think an explicit list of actions that do not run in parallel is a really
> good idea.

I've assigned a ticket about the resource requirements and I'll handle
the "blacklist" of parallel actions.

For the issue of projection speed, I've never had trouble, but I also
mainly use yt for projection non-unigrids.  I believe yt scales well
to unigrids, but it is with the fixed resolution projections (which
are poorly documented) that a number of shortcuts are applied that
would help with unigrid projections.

For AMR projections I would put yt up against any other game in town.
As of the end of next week, when I finally have time to do the
QuadTree projections I wrote six months ago, we should get an order of
magnitude speedup.

> On the bugs issue, it is not clear to me how a new user can tell the
> difference between a bug and simply doing something wrong.  Either way, I
> think that the users list is succeeding in getting people's issues solved,
> provided that the issues make it there.  I think we really need to encourage
> all users to be taking all potential issues through the list first, even if
> the resolution eventually takes place off-list.  Even if the feature author
> is doing all of the talking, this makes any new knowledge public, and allows
> other people to help out if they can.  Most people already do this, but I
> would suggest again that we ask that any requests for help we receive
> directly be resent to the user list.

I agree with this.  I think we should start encouraging *very*
strongly that all questions and problems with the software be
discussed on the list.  I'm a leaky spot in the pipeline, I do
confess, but I'll try my best to redirect questions I get off list
back to on-list.

For what it's worth (and I'll address this in my reply to Jeff) I've
taken the time to clean up and add some niceties to Trac.  This
includes OpenID authentication and a link to reporting a bugs.  In the
other email I'll expand a bit.

> On the community, there clearly needs to be a balance between what users
> expect to get from the code and what the developers are obligated to
> provide.  With exceptions, a vast majority of those contributing code are
> not doing so in their spare time.  Much of this code is related to their own
> work, but a nonzero amount is stuff that simply needs to get done.  Users of
> the code need to recognize this.  However, at the same time, we as
> developers need to hold ourselves to some standards, namely, if we say the
> code does something, it better do it.  Clearly, there are situations where
> we can not deliver on this, at least not right away.  In general though, we
> need to be clear about what the code will and will not do, and see that our
> statements are and remain true.

I agree with this.  It's a tough balance, between doing what you need
to do and trying to share it, but also making sure it is useful to
other people.  I think you've summed that up nicely.

Recently I added a level-of-support grid for astro codes, of features
versus codes.  Maybe we should consider adding coarse features and a
level-of-reliability estimate.

> I think we should consider setting up a wish list, where users can submit
> ideas for features that they would like see added to the code.  This should
> be viewable by everyone.  I think this might help people stay conscious of
> the fact that if they want something, that means someone has to physically
> go and do it for them.  Maybe this will even give people the notion that
> they can do it on their own.

That would be excellent.  I've set up a new page:

http://yt.enzotools.org/wiki/WishList

that inclues my previous list of project ideas from the
GettingInvolved page and adds on a query to find all the open tickets.
 Anyone that validates with an OpenID (this may be a mistake ...) can
edit the page and add new wishlist items.  However, I've also mirrored
the hg passwords, and I'd prefer if developers logged in with those.

Again, thanks for your thoughtful reply.

-Matt

>
> Britton
>
> On Thu, Aug 19, 2010 at 11:54 PM, Matthew Turk <matthewturk at gmail.com>
> wrote:
>>
>> Hi all,
>>
>> I'm going to top post, which I guess I do more than I ought to anyway,
>> because I'm going to try to address a number of issues that have been
>> brought up.  I've spent some of the day thinking about this issue, and
>> what it says about yt as a community and about my level of involvement
>> in various areas.
>>
>> So, I'll touch on those at the end, but first I'll hit back on the
>> issue of parallelism and how to address it.
>>
>> = Parallelism =
>>
>> I think what is becoming clear is that the step from serial to
>> parallel, in terms of user experience, should be more well-handled
>> than it currently is.  As it stands, the section in the manual that
>> covers parallelism basically says, "These things work, go ahead and
>> give it a go!"  This is my fault, and it's not really sufficient.
>> More detail has to be given, and rather than a whitelist of actions
>> that are parallel safe we need to also include a *blacklist*.
>>
>> The second step we need to take is provide examples of how to submit a
>> parallel job -- how much it requires in terms of resources and so on.
>> Unfortunately, it's not entirely clear to me the best way to organize
>> the documentation, and I don't even really know where this would go.
>> Stephen did a really rad job of doing this in the halo finding paper,
>> and he's done an excellent job with his work on the halo finder as a
>> whole.  (It's just that last 5% toward the user experience, I think.
>> :)  My own work on the parallel projections should be better
>> documented and the UX there should be improved as well.
>>
>> The third is to keep an eye on memory usage.  Memory profiling is
>> difficult, but it's something we have tried before and that I believe
>> needs to be re-examined.  Specifically, it seems that both projections
>> and the parallel halo finder suffer from this problem.  As a note,
>> next week I will be spending some time swapping out the old projection
>> method for the new quad-tree method.  This should improve both speed
>> and memory usage.
>>
>> Okay, on to the larger problems that I think this relates to.
>>
>> = Bugs =
>>
>> First off, we need a mechanism for handling and bugs.  I don't want to
>> use the word "triage" here, but it is becoming clear that we need a
>> mechanism.  Currently, we have a Trac site that really doesn't get
>> used at all.  I've explored a couple mechanisms for encouraging bug
>> reports.
>>
>>  * I can enable OpenID login -- this means using something like your
>> GoogleName to log in and report a bug.
>>  * I've already replicated the .htpasswd between mercurial and the
>> Trac site, so anyone who has a report there can log in to the Trac
>> site.
>>  * yt could register a default excepthook that encourages the user to
>> report a bug.  I'm leery of this because I'm not sure I want to muck
>> about with Python internals that much, but it could be done nicely, I
>> think.
>>
>> Overall, though, what really needs to happen is some kind of *buy-in*
>> on the part of the user -- which in this case is anyone who has had
>> trouble with yt.  I have pulled back from yt-users, and I'm really
>> happy that everyone else has stepped up.  But I'm worried that as time
>> goes on, people will pick up knowledge in ways that aren't indexable
>> by search engines and then this knowledge keeps getting re-learned.
>>
>> Public reporting of bugs, particularly as it could relate to
>> improvements in documentation, is essential.  But this can't happen if
>> it's just driven by one or two people.  And if no one else is
>> motivated to encourage this, then perhaps that's just where we'll
>> stay.  I can't force buy-in, I can only encourage people to see the
>> benefits to reporting bugs, sharing experiences, and all of that.  We
>> need to have people to read and handle bugs, and then people to whom
>> they apply.  I really would like for this not always to be me.
>>
>> Anyway, if you have an hg account, you can login:
>>
>> http://yt.enzotools.org/login/
>>
>> and then report a bug:
>>
>> http://yt.enzotools.org/newticket
>>
>> It helps if you paste the traceback with --paste on the command line
>> of your script.
>>
>> = Fixing Bugs =
>>
>> We've had great success with people taking ownership of different bugs
>> on the mailing list and fixing them.  This is a huge success story,
>> and I thank all the developers that have made this happen.  But I
>> think it's important that we continue to develop this sense of
>> ownership through the Trac site.
>>
>> = Major Enhancements =
>>
>> Adding on major enhancements is unfortunately an open problem.  For
>> instance, I would really like to see the parallelism framework
>> essentially rewritten to be more modular and to take advantage of
>> nested MPI communicators.  I have a sketch of how this would go, and
>> I've even written some code.  But, I'm not employed to work on yt.  I
>> mostly develop it either as it suits my research interests (and I am
>> operating under the working assumption that this is true for everyone
>> else) or as I find it something fun to do in the evenings.  I want it
>> to be used, and to be useful, and I believe that my stewardship of the
>> project up to this point supports this conclusion.
>>
>> I *truly* do believe in cross-code simulation analysis, sharing
>> facilities with other users, and reproducible research.  But I am
>> reaching the limits of what I, alone, can do.  So far we've had some
>> pretty major contributions from a number of developers, but I think
>> it's important that we communicate to the community that this is still
>> a volunteer project.
>>
>> We don't have a team of dedicated software developers, we have a
>> handful of scientists who are working to both further their own
>> research interests while providing the best user experience possible
>> for an advanced analysis code.  And, to be perfectly frank, I think
>> we're doing a pretty darn good job on both of those fronts.  Many
>> people now use yt on a daily basis to analyze simulation outputs from
>> several different codes.  We've got advanced analysis and viz
>> functionality, thanks to *you* developers, that has been published a
>> dozen papers, been shown at the Adler Planetarium, taken home the
>> third place at the SciDAC visualization "Oscars," and even (ever so
>> briefly) been on the Discovery channel.
>>
>> But, still, we have to keep our eye on the prize.  And if the prize
>> the other developers have *their* eye on isn't the prize *you* have
>> your eye on, unfortunately some responsibility will fall back on to
>> your shoulders.  I honestly wish I could spend more time helping
>> others use yt, developing yt, and building it to be the tool I really
>> wish it would be.  Don't think that I don't see all the warts and
>> problems that you all see -- I do.  In the docs, the source code, the
>> functionality, the user experience ... I see the warts too.
>>
>> But even though developing yt is fun, I'm still developing it because
>> I'm a scientist who wants to ask questions of his data.
>>
>> = Building Community =
>>
>> We've done a good job of this, but it's becoming clear that there's a
>> disjoint:
>>
>>  * We're doing a mediocre job of shepherding users into being
>> contributing developers.  I'd like to help fix this by writing up more
>> suggestions on how to develop and share your changes.  yt will
>> stagnate if we don't continue to churn the developer list.
>>  * We need to articulate the vision for yt, and I'm not sure my vision
>> is the one anyone else has.
>>
>> I'd love to hear suggestions about this aspect.
>>
>> = Documentation =
>>
>> Any help anyone can give with documentation would be great.
>> Organization, notes, suggestions, anything.  Report it as a bug.
>> Commit changes.  Email the list.
>>
>> ==
>>
>> Anyway, that's basically what I've been thinking about, and what I
>> wanted to say.  I think we have an opportunity with yt to build a real
>> community of collaboration and sharing of resources.  And we've done a
>> great job with that so far.  But it still has to be something of a
>> jumpstart approach -- jumpstarting development and then encouraging
>> others to pick up the torch and run with it.  Grass roots,
>> science-driven development is kind of the name of the game here.
>>
>> And when there *are* problems, I'm sure that lots of people are eager
>> to jump at helping you fix them.  But we have to hear about 'em before
>> we can.  :)
>>
>> Thanks,
>>
>> Matt
>>
>> On Thu, Aug 19, 2010 at 1:12 PM, Stephen Skory <stephenskory at yahoo.com>
>> wrote:
>> > Hi Brian & Eric,
>> >
>> >>As you know (since we discussed it off-list), I'm the reason for this
>> >> being
>> >>mentioned to you.  I had some pretty horrible problems with the various
>> >>incarnations of HOP in yt being excruciatingly slow and consuming huge
>> >> amounts
>> >>of memory for a 1024^3 unigrid dataset, to the point where my grad
>> >> student and I
>> >>
>> >>ended up just using P-GroupFinder, the standalone halo finder that comes
>> >> with
>> >>week-of-code enzo.  Note that when I say "excruciatingly slow" and
>> >> "consuming
>> >>huge amounts of memory", I mean that when we used 256 nodes on Ranger,
>> >> with 2
>> >>cores/node (so 512 cores total) for the 1024^3 dataset, it still ran
>> >> Ranger out
>> >
>> >>of memory, or, alternately, didn't finish in 24 hours.
>> >
>> > A few notes in response:
>> >
>> > - Recently I ran a 2048^3 dataset on 264 cores that took about 2 hours
>> > which
>> > averaged about 8.5GB per task with a peak task of 10 GB. Your job is 1/8
>> > the
>> > size and should have run, and I don't know why it didn't.
>> >
>> > - If I wasn't trying to graduate I would have had more time to assist
>> > when your
>> > student (Brian) asked me for help. I'm sorry so much of your time was
>> > wasted.
>> >
>> > - My tool as a public tool is not any good unless other people can use
>> > it too.
>> > Clearly I need to do some work on that.
>> >
>> > - It *does* use much more memory than it needs to, you are right. I know
>> > where
>> > the problems are, and whoo-boy they are there, but they are not easy to
>> > fix.
>> >
>> > - Speed could be better, but some of this has to do with how HOP itself
>> > works.
>> > For example, it needs to run the kD tree twice, unlike FOF which needs
>> > to only
>> > once. The final group building step is a "global" operation, so that's
>> > slow as
>> > well. On 128^3 particles, (normal) HOP takes about 75 seconds, and FOF
>> > about 25.
>> > The C HOP and FOF in yt both use the same kD tree, same data I/O
>> > methods, so
>> > that's a fair ratio of the increased workload.
>> >
>> >
>> >  _______________________________________________________
>> > sskory at physics.ucsd.edu           o__  Stephen Skory
>> > http://physics.ucsd.edu/~sskory/ _.>/ _Graduate Student
>> > ________________________________(_)_\(_)_______________
>> >
>> > _______________________________________________
>> > Yt-dev mailing list
>> > Yt-dev at lists.spacepope.org
>> > http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>> >
>> _______________________________________________
>> Yt-dev mailing list
>> Yt-dev at lists.spacepope.org
>> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>
>
> _______________________________________________
> Yt-dev mailing list
> Yt-dev at lists.spacepope.org
> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
>
>