[yt-dev] Debugging parallel objects?

Matthew Turk matthewturk at gmail.com
Wed Oct 3 13:53:31 PDT 2012


Hey Nathan,

On Wed, Oct 3, 2012 at 4:48 PM, Nathan Goldbaum <nathan12343 at gmail.com> wrote:
> Hi all,
>
> I have time series script that iterates over a bunch of simulation outputs, does some analysis, and then dumps a pickle with the results of the analysis.  This script works correctly when I loop over the outputs like so:
>
> for pf in ts:
>         do stuff
>
> But when I use the piter functionality to loop over the outputs in parallel:
>
> for sto,pf in ts.piter(storage=sto):
>         do the same stuff
>
> the script will sometimes hang when I run on more than one processor.  It's difficult to find exactly where and why it's hanging since the run is distributed - I'd like to be able to reproduce the error to track down why it's happening.

My guess is that there's a problem with multi-level parallelism, or
with processors not having any work.

>
> I'm curious if anyone has any tips for debugging parallel operations in yt.  I'm not very familiar with the internals of the parallel_objects machinery, so likely places to check or put breakpoints would be very helpful.  Also, are there parallel debugging tools for python?

Okay, there are a couple things that help, that come with yt that I
put in when debugging the initial parallel code a couple years ago.
Run with --rpdb , which will spawn an XML-RPC "remote" pdb that can
respond to most commands.  Then kill with SIGUSR2 which I think will
throw a runtime error when it gets caught by yt.  This puts all the
processors into the (*remarkably* dangerous) state of waiting for pdb
commands.  Now run:

yt rpdb -t WHATEVER

where WHATEVER is 0 .. NPROC-1.  This has to be run on the same host
as the process you want to pdb.  This will put you into pdb mode.  You
can issue "shutdown" which will shut down a single process, but it's
probably easier to kill the mpirun task when you've figured it out.

For a simpler way to inspect where processes are, kill with SIGUSR1.
This will cause a stack trace to print.

Good luck -- and if you can figure out how to repeat it simply, send
that along too.

-Matt

>
> Thanks for your help,
>
> Nathan
> _______________________________________________
> yt-dev mailing list
> yt-dev at lists.spacepope.org
> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org



More information about the yt-dev mailing list