[yt-users] parallel_objects with projection hanging

Nathan Goldbaum nathan12343 at gmail.com
Thu Dec 11 14:22:53 PST 2014


On Thu, Dec 11, 2014 at 2:17 PM, Semyeong Oh <semyeong.oh at gmail.com> wrote:
>
> Hello,
>
> The cause of this barrier problem seems to have nothing to do with
> projection.
> It was calling the hierarchy before entering parallel_objects, which still
> is a strange thing.
> ( I was calling it because I wanted to do get_smallest_dx())
> I attach a minimal example that reproduces the problem when run with -np 2
> If this is expected, it’d be great if this was explicitly noted on the
> documentation.
>
> Semyeong
>
> from yt.config import ytcfg
> ytcfg['yt', 'loglevel'] = '1'
>
> from yt.mods import *
> import numpy as np
>
>
> cs = np.random.rand(3, 3)
>
> def do(c, pf):
>     cube = pf.h.region(c, c-0.1, c+0.1)
>     proj = pf.h.proj(2, 'Density', source=cube)
>
> def do_parallel(pf):
>     pf.h   <----
>     for c in parallel_objects(cs):
>         do(c, pf)
>
> if __name__ == '__main__':
>     pf = load('../../enzo_tiny_cosmology/DD0046/DD0046')
>     do_parallel(pf)
>

No, I don't think this is expected.  Can you open an issue?  Please include
the above reproduction script in the issue.

https://bitbucket.org/yt_analysis/yt/issues/new


>
>
> > On Dec 7, 2014, at 2:51 PM, John ZuHone <jzuhone at gmail.com> wrote:
> >
> > For what it's worth, this is essentially the same problem I reported the
> other day--projections in parallel_objects hanging.
> >
> > John ZuHone
> > Kavli Center for Astrophysics and Space Research
> > Massachusetts Institute of Technology
> > 77 Massachusetts Ave., 37-582G
> > Cambridge, MA 02139
> > (w) 617-253-2354
> > (m) 781-708-5004
> > jzuhone at space.mit.edu
> > jzuhone at gmail.com
> > http://www.jzuhone.com
> >
> > On Dec 7, 2014, at 1:51 PM, Matthew Turk <matthewturk at gmail.com> wrote:
> >
> >> Hi Semyeong,
> >>
> >> Sounds like a mismatch barrier.
> >>
> >> Can you try with the enzo_tiny_cosmology dataset from the website to
> >> see if that works?  I am thinking there may be a corner case we
> >> haven't seen in the deomain decomp.  We'll get this fixed!
> >>
> >> -Matt
> >>
> >> On Sat, Dec 6, 2014 at 7:47 PM, Semyeong Oh <semyeong.oh at gmail.com>
> wrote:
> >>> Hi Matt,
> >>>
> >>> Counting Opening MPI Barrier on XX in log, I think what happens is
> >>> somehow after one projection a barrier is opened. Thus, the process
> which does the projections opens different
> >>> number of barriers from the process that’s idle and only opens a
> barrier due to this statement in parallel_objects:
> >>>    if barrier:
> >>>        my_communicator.barrier()
> >>> and this is why it hangs after the second projection.
> >>>
> >>> Setting barrier=False on parallel_objects wouldn’t work either because
> than it hangs after the barrier after the first projection.
> >>> Is this expected, and is there a workaround?
> >>> I couldn’t pinpoint why this happens exactly, but I’m guessing it has
> something to do with
> >>> that yt supports doing the actual projection in parallel, not just
> using parallel_objects to parallelize over multiple objects.
> >>>
> >>> Semyeong
> >>>
> >>>> On Dec 6, 2014, at 7:39 PM, Semyeong Oh <semyeong.oh at gmail.com>
> wrote:
> >>>>
> >>>> Hi Matt,
> >>>>
> >>>> My output from using 2 processors on 3 objects is here
> https://gist.github.com/smoh/6e396a7606a3bbff3450
> >>>> I’ve set loglevel to 1.
> >>>> The first two objects ran fine, printing some output
> >>>> 500 …
> >>>> 501 …
> >>>> Then, the third object start running on rank 0, while rank 1 sleeps.
> The projection happens, but after that upon this line,
> >>>> P000 yt : [DEBUG    ] 2014-12-06 19:03:48,494 Opening MPI Barrier on 0
> >>>> the process never ends.
> >>>> The rest of the output is from sending sigusr1 to each process, which
> I hardly understand..
> >>>> (There are two projections involved in my calculation)
> >>>>
> >>>> Any clues?
> >>>>
> >>>> Thanks,
> >>>> Semyeong
> >>>>
> >>>>
> >>>>> On Dec 6, 2014, at 11:53 AM, Matthew Turk <matthewturk at gmail.com>
> wrote:
> >>>>>
> >>>>> Hi Semyeong,
> >>>>>
> >>>>> This is somewhat odd.  When you say the process hangs, do you mean
> >>>>> that the process of projection hangs, or the yt script as a whole
> >>>>> hangs?  You should be able to send SIGUSR1 to the processes to get a
> >>>>> stack trace, which may help with debugging.  Or, if you Ctrl-C, it
> may
> >>>>> output a stack trace, which will help see where it's hanging.
> >>>>>
> >>>>> -Matt
> >>>>>
> >>>>> On Sat, Dec 6, 2014 at 5:14 AM, Semyeong Oh <semyeong.oh at gmail.com>
> wrote:
> >>>>>> Hi yt,
> >>>>>>
> >>>>>> I have two questions on using parallel_objects. I am using yt 2.6.
> >>>>>>
> >>>>>> 1. I have a problem of parallel_objects hanging at the end.
> >>>>>>
> >>>>>> def do(i, pf):
> >>>>>> cube = pf.h.region(..)
> >>>>>> proj = pf.h.proj(…., source=cube)
> >>>>>> frb = proj.to_frb(..)
> >>>>>> ….
> >>>>>>
> >>>>>> objects = [list of indices]
> >>>>>> pf = load(..)
> >>>>>> for i in parallel_objects(objects):
> >>>>>>  do(i, pf)
> >>>>>>
> >>>>>> and I run the script as
> >>>>>> mpirun -np Nprocs python myscripy.py —parallel
> >>>>>>
> >>>>>> When I tested with a simple print operation in do instead of proj,
> the parallel_objects seem to handle
> >>>>>> cases when Nobjects is not divisible by Nprocs just fine. But with
> my real script that has proj in do, it seems to hang at the end. For
> example, if Nobjects is 3 and Nprocs is 2, the first two objects goes
> without problem, but the projection of the third completes, but the process
> sort of hangs there. Why so?
> >>>>>>
> >>>>>> 2. Is it possible to use a portion of Nprocs assigned? Also playing
> around with simple print operation, it seems that because of the way
> parallel_objects divide work, the work is duplicated. e..g, when I do
> mpirun -np 5 but have parallal_objects(objects, njobs=3)
> >>>>>> rank i_object
> >>>>>> 0 1
> >>>>>> 1 1
> >>>>>> 2 2
> >>>>>> 3 2
> >>>>>> 4 3
> >>>>>> …
> >>>>>> so object 1 would still run simultaneously on rank 0 and 1.
> >>>>>>
> >>>>>> To prevent this, would something like below work?
> >>>>>>
> >>>>>> size = MPI.COMM_WORLD.Get_size()
> >>>>>> rank = MPI.COMM_WORLD.Get_rank()
> >>>>>> njobs = 3
> >>>>>> for ind in parallel_objects(objects, njobs):
> >>>>>>  if rank % int(size/njobs) != 0:
> >>>>>>      continue
> >>>>>>  else:
> >>>>>>      do(ind)
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Semyeong
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> yt-users mailing list
> >>>>>> yt-users at lists.spacepope.org
> >>>>>> http://lists.spacepope.org/listinfo.cgi/yt-users-spacepope.org
> >>>>> _______________________________________________
> >>>>> yt-users mailing list
> >>>>> yt-users at lists.spacepope.org
> >>>>> http://lists.spacepope.org/listinfo.cgi/yt-users-spacepope.org
> >>>>
> >>>
> >>> _______________________________________________
> >>> yt-users mailing list
> >>> yt-users at lists.spacepope.org
> >>> http://lists.spacepope.org/listinfo.cgi/yt-users-spacepope.org
> >> _______________________________________________
> >> yt-users mailing list
> >> yt-users at lists.spacepope.org
> >> http://lists.spacepope.org/listinfo.cgi/yt-users-spacepope.org
> > _______________________________________________
> > yt-users mailing list
> > yt-users at lists.spacepope.org
> > http://lists.spacepope.org/listinfo.cgi/yt-users-spacepope.org
>
> _______________________________________________
> yt-users mailing list
> yt-users at lists.spacepope.org
> http://lists.spacepope.org/listinfo.cgi/yt-users-spacepope.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.spacepope.org/pipermail/yt-users-spacepope.org/attachments/20141211/0362948e/attachment.htm>


More information about the yt-users mailing list