Barry --

Could you check what apps are still running when it hangs? I.e., I assume that all the uptime's are dead; are all the orted's dead on the remote nodes? (orted = our helper process that is launched on the remote nodes to exert process control, funnel I/O back and forth to mpirun, etc.)

If any of the orted's are still running, can you connect to them with gdb and get a backtrace to see where they are hung?

Likewise, can you connect to mpirun with gdb and get a backtrace of where it's hung?

Ralph, the main ORTE developer, is pretty sure that it's stuck in the IO flushing routines that are executed at the end of time (look for function names like iof_flush or similar). We thought we had fixed all of those on the 1.2 branch, but perhaps there's some other weird race condition happening that doesn't happen on our test machines...



On Jan 13, 2008, at 10:17 AM, Barry Rountree wrote:

On Sun, Jan 13, 2008 at 09:54:47AM -0500, Barry Rountree wrote:
> Hello,
>
> The following command
>
> mpirun -np 2 -hostfile ~/hostfile uptime
>
> will occasionally hang after completing. The expected output appears on
> the screen, but mpirun needs a SIGKILL to return to the console.
>
> This has been verified with OpenMPI v1.2.4 compiled with both icc 9.1 > 20061101 (aka 9.1.045) and gcc 4.1.0 20060304 (aka Red Hat 4.1.0-3). I
> have also tried earlier versions of OpenMPI and found the same bug
> (1.1.2 and 1.2.2).
>
> Using -verbose didn't provide any additional output. I'm happy to help
> tracking down whatever is causing this.

A couple more data points:

mpirun -np 4 -hostfile ~/hostfile --no-daemonize uptime

hung twice over 100 runs. Without the --no-daemonize, the command hung 16 times over 100 runs. (This is using the version compiled with icc.)

Barry

>
> Many thanks,
>
> Barry Rountree
> Ph.D. Candidate, Computer Science
> University of Georgia
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Cisco Systems

Reply via email to