[Yt-dev] RMS mass overdensity method

Matthew Turk matthewturk at gmail.com
Wed Mar 4 08:33:54 PST 2009


> Yes, in fact, I can see that after the first MPI error, other threads can still go and try to get spheres before they get killed by task manager.

Okay, hm, interesting.

>
> There are no core dumps. I ssh-ed in and ran 'top' and looked at the memory per process and total usage on the node, and it wasn't approaching the limits of the machine when it crashed. I think the fact that it crashes at different places in the run cycle means something else, but I don't know what.

This is suspicious.  In fact, it makes me think there *could* be a
problem with processes hanging, waiting for Barriers.  Do you have
debugging logging turned on?  That should notify you whenever a
barrier is entered if it's done via the standard barrierization.  (One
of the reasons I try to avoid any raw MPI calls.)

I'll see if I can write up a long-overdue mechanism for distinguishing
logs by processor and paste that.

-Matt



More information about the yt-dev mailing list