[yt-users] parallel_objects with projection hanging

Matthew Turk matthewturk at gmail.com
Thu Dec 11 15:24:30 PST 2014


I will address this immediately -- thank you very, very much for
tracking it down.

On Thu, Dec 11, 2014 at 4:22 PM, Nathan Goldbaum <nathan12343 at gmail.com> wrote:
>
>
> On Thu, Dec 11, 2014 at 2:17 PM, Semyeong Oh <semyeong.oh at gmail.com> wrote:
>>
>> Hello,
>>
>> The cause of this barrier problem seems to have nothing to do with
>> projection.
>> It was calling the hierarchy before entering parallel_objects, which still
>> is a strange thing.
>> ( I was calling it because I wanted to do get_smallest_dx())
>> I attach a minimal example that reproduces the problem when run with -np 2
>> If this is expected, it’d be great if this was explicitly noted on the
>> documentation.
>>
>> Semyeong
>>
>> from yt.config import ytcfg
>> ytcfg['yt', 'loglevel'] = '1'
>>
>> from yt.mods import *
>> import numpy as np
>>
>>
>> cs = np.random.rand(3, 3)
>>
>> def do(c, pf):
>>     cube = pf.h.region(c, c-0.1, c+0.1)
>>     proj = pf.h.proj(2, 'Density', source=cube)
>>
>> def do_parallel(pf):
>>     pf.h   <----
>>     for c in parallel_objects(cs):
>>         do(c, pf)
>>
>> if __name__ == '__main__':
>>     pf = load('../../enzo_tiny_cosmology/DD0046/DD0046')
>>     do_parallel(pf)
>
>
> No, I don't think this is expected.  Can you open an issue?  Please include
> the above reproduction script in the issue.
>
> https://bitbucket.org/yt_analysis/yt/issues/new
>
>>
>>
>>
>> > On Dec 7, 2014, at 2:51 PM, John ZuHone <jzuhone at gmail.com> wrote:
>> >
>> > For what it's worth, this is essentially the same problem I reported the
>> > other day--projections in parallel_objects hanging.
>> >
>> > John ZuHone
>> > Kavli Center for Astrophysics and Space Research
>> > Massachusetts Institute of Technology
>> > 77 Massachusetts Ave., 37-582G
>> > Cambridge, MA 02139
>> > (w) 617-253-2354
>> > (m) 781-708-5004
>> > jzuhone at space.mit.edu
>> > jzuhone at gmail.com
>> > http://www.jzuhone.com
>> >
>> > On Dec 7, 2014, at 1:51 PM, Matthew Turk <matthewturk at gmail.com> wrote:
>> >
>> >> Hi Semyeong,
>> >>
>> >> Sounds like a mismatch barrier.
>> >>
>> >> Can you try with the enzo_tiny_cosmology dataset from the website to
>> >> see if that works?  I am thinking there may be a corner case we
>> >> haven't seen in the deomain decomp.  We'll get this fixed!
>> >>
>> >> -Matt
>> >>
>> >> On Sat, Dec 6, 2014 at 7:47 PM, Semyeong Oh <semyeong.oh at gmail.com>
>> >> wrote:
>> >>> Hi Matt,
>> >>>
>> >>> Counting Opening MPI Barrier on XX in log, I think what happens is
>> >>> somehow after one projection a barrier is opened. Thus, the process
>> >>> which does the projections opens different
>> >>> number of barriers from the process that’s idle and only opens a
>> >>> barrier due to this statement in parallel_objects:
>> >>>    if barrier:
>> >>>        my_communicator.barrier()
>> >>> and this is why it hangs after the second projection.
>> >>>
>> >>> Setting barrier=False on parallel_objects wouldn’t work either because
>> >>> than it hangs after the barrier after the first projection.
>> >>> Is this expected, and is there a workaround?
>> >>> I couldn’t pinpoint why this happens exactly, but I’m guessing it has
>> >>> something to do with
>> >>> that yt supports doing the actual projection in parallel, not just
>> >>> using parallel_objects to parallelize over multiple objects.
>> >>>
>> >>> Semyeong
>> >>>
>> >>>> On Dec 6, 2014, at 7:39 PM, Semyeong Oh <semyeong.oh at gmail.com>
>> >>>> wrote:
>> >>>>
>> >>>> Hi Matt,
>> >>>>
>> >>>> My output from using 2 processors on 3 objects is here
>> >>>> https://gist.github.com/smoh/6e396a7606a3bbff3450
>> >>>> I’ve set loglevel to 1.
>> >>>> The first two objects ran fine, printing some output
>> >>>> 500 …
>> >>>> 501 …
>> >>>> Then, the third object start running on rank 0, while rank 1 sleeps.
>> >>>> The projection happens, but after that upon this line,
>> >>>> P000 yt : [DEBUG    ] 2014-12-06 19:03:48,494 Opening MPI Barrier on
>> >>>> 0
>> >>>> the process never ends.
>> >>>> The rest of the output is from sending sigusr1 to each process, which
>> >>>> I hardly understand..
>> >>>> (There are two projections involved in my calculation)
>> >>>>
>> >>>> Any clues?
>> >>>>
>> >>>> Thanks,
>> >>>> Semyeong
>> >>>>
>> >>>>
>> >>>>> On Dec 6, 2014, at 11:53 AM, Matthew Turk <matthewturk at gmail.com>
>> >>>>> wrote:
>> >>>>>
>> >>>>> Hi Semyeong,
>> >>>>>
>> >>>>> This is somewhat odd.  When you say the process hangs, do you mean
>> >>>>> that the process of projection hangs, or the yt script as a whole
>> >>>>> hangs?  You should be able to send SIGUSR1 to the processes to get a
>> >>>>> stack trace, which may help with debugging.  Or, if you Ctrl-C, it
>> >>>>> may
>> >>>>> output a stack trace, which will help see where it's hanging.
>> >>>>>
>> >>>>> -Matt
>> >>>>>
>> >>>>> On Sat, Dec 6, 2014 at 5:14 AM, Semyeong Oh <semyeong.oh at gmail.com>
>> >>>>> wrote:
>> >>>>>> Hi yt,
>> >>>>>>
>> >>>>>> I have two questions on using parallel_objects. I am using yt 2.6.
>> >>>>>>
>> >>>>>> 1. I have a problem of parallel_objects hanging at the end.
>> >>>>>>
>> >>>>>> def do(i, pf):
>> >>>>>> cube = pf.h.region(..)
>> >>>>>> proj = pf.h.proj(…., source=cube)
>> >>>>>> frb = proj.to_frb(..)
>> >>>>>> ….
>> >>>>>>
>> >>>>>> objects = [list of indices]
>> >>>>>> pf = load(..)
>> >>>>>> for i in parallel_objects(objects):
>> >>>>>>  do(i, pf)
>> >>>>>>
>> >>>>>> and I run the script as
>> >>>>>> mpirun -np Nprocs python myscripy.py —parallel
>> >>>>>>
>> >>>>>> When I tested with a simple print operation in do instead of proj,
>> >>>>>> the parallel_objects seem to handle
>> >>>>>> cases when Nobjects is not divisible by Nprocs just fine. But with
>> >>>>>> my real script that has proj in do, it seems to hang at the end. For
>> >>>>>> example, if Nobjects is 3 and Nprocs is 2, the first two objects goes
>> >>>>>> without problem, but the projection of the third completes, but the process
>> >>>>>> sort of hangs there. Why so?
>> >>>>>>
>> >>>>>> 2. Is it possible to use a portion of Nprocs assigned? Also playing
>> >>>>>> around with simple print operation, it seems that because of the way
>> >>>>>> parallel_objects divide work, the work is duplicated. e..g, when I do mpirun
>> >>>>>> -np 5 but have parallal_objects(objects, njobs=3)
>> >>>>>> rank i_object
>> >>>>>> 0 1
>> >>>>>> 1 1
>> >>>>>> 2 2
>> >>>>>> 3 2
>> >>>>>> 4 3
>> >>>>>> …
>> >>>>>> so object 1 would still run simultaneously on rank 0 and 1.
>> >>>>>>
>> >>>>>> To prevent this, would something like below work?
>> >>>>>>
>> >>>>>> size = MPI.COMM_WORLD.Get_size()
>> >>>>>> rank = MPI.COMM_WORLD.Get_rank()
>> >>>>>> njobs = 3
>> >>>>>> for ind in parallel_objects(objects, njobs):
>> >>>>>>  if rank % int(size/njobs) != 0:
>> >>>>>>      continue
>> >>>>>>  else:
>> >>>>>>      do(ind)
>> >>>>>>
>> >>>>>> Thanks,
>> >>>>>> Semyeong
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> _______________________________________________
>> >>>>>> yt-users mailing list
>> >>>>>> yt-users at lists.spacepope.org
>> >>>>>> http://lists.spacepope.org/listinfo.cgi/yt-users-spacepope.org
>> >>>>> _______________________________________________
>> >>>>> yt-users mailing list
>> >>>>> yt-users at lists.spacepope.org
>> >>>>> http://lists.spacepope.org/listinfo.cgi/yt-users-spacepope.org
>> >>>>
>> >>>
>> >>> _______________________________________________
>> >>> yt-users mailing list
>> >>> yt-users at lists.spacepope.org
>> >>> http://lists.spacepope.org/listinfo.cgi/yt-users-spacepope.org
>> >> _______________________________________________
>> >> yt-users mailing list
>> >> yt-users at lists.spacepope.org
>> >> http://lists.spacepope.org/listinfo.cgi/yt-users-spacepope.org
>> > _______________________________________________
>> > yt-users mailing list
>> > yt-users at lists.spacepope.org
>> > http://lists.spacepope.org/listinfo.cgi/yt-users-spacepope.org
>>
>> _______________________________________________
>> yt-users mailing list
>> yt-users at lists.spacepope.org
>> http://lists.spacepope.org/listinfo.cgi/yt-users-spacepope.org
>
>
> _______________________________________________
> yt-users mailing list
> yt-users at lists.spacepope.org
> http://lists.spacepope.org/listinfo.cgi/yt-users-spacepope.org
>
_______________________________________________
yt-users mailing list
yt-users at lists.spacepope.org
http://lists.spacepope.org/listinfo.cgi/yt-users-spacepope.org




More information about the yt-users mailing list