[yt-dev] Rockstar on multiple nodes

Matthew Turk matthewturk at gmail.com
Tue Nov 27 13:21:11 PST 2012


Hi Stephen,

To answer your last question first, I think there is exactly zero
chance there's a problem with Rockstar itself, and I think the problem
almost certainly is a result of the library wrapping.  Comments below.

On Tue, Nov 27, 2012 at 3:50 PM, Stephen Skory <s at skory.us> wrote:
> Hi Peter and yt developers,
>
> (Peter - I am copying you on this message that is also going to the yt
> developers email list. I believe if you reply-all the yt-dev copy of
> the message will bounce back to you, or be held for moderation, if
> you're not subscribed to yt-dev.)
>
> I am having trouble running Rockstar within yt on more than one node.
> For example, I am successful in running with 12 tasks total on one
> node, but 6 tasks each on two nodes does not work. By not working, I
> mean that it prints a few of these messages (one per PID, but not by
> all PIDs, it appears to equal NUM_WRITERS)
>
> "[Warning] Network IO Failure (PID NNNNN):"
>
> with various explanations ("Connection reset by peer", "Address
> already in use", "Broken pipe"). After that it prints

This sounds to me like a problem with the forking model that's in
place, but I might be wrong.  The way I originally set it up, yt would
guess at addresses and hosts, which get fed in after the call to
_get_hosts.  My recommendation would be to have all of these items
printed out and then look into DEBUG_RSOCKET.

>
> "[Network] Packet send retry count at: 1"
>
> And then it hangs.
>
> I have turned on DEBUG_RSOCKET and I can see that tasks on both nodes
> are communicating with the server, and I also see "Accepted all reader
> / writer connections." and "Verified all reader / writer connections."
> The process gets as far as "Analyzing for halos / subhalos..." but it
> does not make it to " Constructing merger tree..." I am running on a
> single snapshot.

Is it possible that a process has died?

>
> I have tracked down that the call of accept() in _accept_connection()
> (in socket.c) is where the hang is happening. It looks like that has
> been called by repair_connection() (in rsocket.c). If I'm interpreting
> things correctly (please correct me if I'm not), the other half of
> repair_connection() is a call to _reconnect_to_addr() done by a
> different task. It looks to me like it is not being called to match by
> a different task, and that's where the hang is happening.
>
> I have done a test with stand-alone Rockstar on the same machine, and
> I am successful running it on 2 nodes. I think this means that there
> is some weirdness with the communication when running Rockstar as a
> library in yt/Python, and not the machine's network.
>
> I'm wondering if anyone has been successful running Rockstar in yt on
> more than one node? Also, does anyone have any intuition for what
> might be going wrong here?
>
> Thanks!
>
> --
> Stephen Skory
> s at skory.us
> http://stephenskory.com/
> 510.621.3687 (google voice)
> _______________________________________________
> yt-dev mailing list
> yt-dev at lists.spacepope.org
> http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org



More information about the yt-dev mailing list