On Fri, Jan 18, 2008 at 08:33:10PM -0500, Jeff Squyres wrote: > Barry -- > > Could you check what apps are still running when it hangs? I.e., I > assume that all the uptime's are dead; are all the orted's dead on the > remote nodes? (orted = our helper process that is launched on the > remote nodes to exert process control, funnel I/O back and forth to > mpirun, etc.) > > If any of the orted's are still running, can you connect to them with > gdb and get a backtrace to see where they are hung? > > Likewise, can you connect to mpirun with gdb and get a backtrace of > where it's hung? > > Ralph, the main ORTE developer, is pretty sure that it's stuck in the > IO flushing routines that are executed at the end of time (look for > function names like iof_flush or similar). We thought we had fixed > all of those on the 1.2 branch, but perhaps there's some other weird > race condition happening that doesn't happen on our test machines...
I'm happy to help. I've got a paper submission deadline on Tuesday, so it might not be until midweek. Thanks for the reply, Barry > > > > On Jan 13, 2008, at 10:17 AM, Barry Rountree wrote: > > > On Sun, Jan 13, 2008 at 09:54:47AM -0500, Barry Rountree wrote: > > > Hello, > > > > > > The following command > > > > > > mpirun -np 2 -hostfile ~/hostfile uptime > > > > > > will occasionally hang after completing. The expected output > > appears on > > > the screen, but mpirun needs a SIGKILL to return to the console. > > > > > > This has been verified with OpenMPI v1.2.4 compiled with both icc > > 9.1 > > > 20061101 (aka 9.1.045) and gcc 4.1.0 20060304 (aka Red Hat > > 4.1.0-3). I > > > have also tried earlier versions of OpenMPI and found the same bug > > > (1.1.2 and 1.2.2). > > > > > > Using -verbose didn't provide any additional output. I'm happy > > to help > > > tracking down whatever is causing this. > > > > A couple more data points: > > > > mpirun -np 4 -hostfile ~/hostfile --no-daemonize uptime > > > > hung twice over 100 runs. Without the --no-daemonize, the command > > hung > > 16 times over 100 runs. (This is using the version compiled with > > icc.) > > > > Barry > > > > > > > > Many thanks, > > > > > > Barry Rountree > > > Ph.D. Candidate, Computer Science > > > University of Georgia > > > > > > _______________________________________________ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > Cisco Systems > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users