Hi Peter and yt developers,

I am having trouble running Rockstar within yt on more than one node.
For example, I am successful in running with 12 tasks total on one
node, but 6 tasks each on two nodes does not work. By not working, I
mean that it prints a few of these messages (one per PID, but not by
all PIDs, it appears to equal NUM_WRITERS)

"[Warning] Network IO Failure (PID NNNNN):"

with various explanations ("Connection reset by peer", "Address
already in use", "Broken pipe"). After that it prints

"[Network] Packet send retry count at: 1"

And then it hangs.

I have turned on DEBUG_RSOCKET and I can see that tasks on both nodes
are communicating with the server, and I also see "Accepted all reader
/ writer connections." and "Verified all reader / writer connections."
The process gets as far as "Analyzing for halos / subhalos..." but it
does not make it to " Constructing merger tree..." I am running on a
single snapshot.

I have tracked down that the call of accept() in _accept_connection()
(in socket.c) is where the hang is happening. It looks like that has
been called by repair_connection() (in rsocket.c). If I'm interpreting
things correctly (please correct me if I'm not), the other half of
repair_connection() is a call to _reconnect_to_addr() done by a
different task. It looks to me like it is not being called to match by
a different task, and that's where the hang is happening.

I have done a test with stand-alone Rockstar on the same machine, and
I am successful running it on 2 nodes. I think this means that there
is some weirdness with the communication when running Rockstar as a
library in yt/Python, and not the machine's network.

I'm wondering if anyone has been successful running Rockstar in yt on
more than one node? Also, does anyone have any intuition for what
might be going wrong here?


