On Fri, Jan 18, 2008 at 08:33:10PM -0500, Jeff Squyres wrote: > Barry -- > > Could you check what apps are still running when it hangs? I.e., I > assume that all the uptime's are dead; are all the orted's dead on the > remote nodes? (orted = our helper process that is launched on the > remote nodes to exert process control, funnel I/O back and forth to > mpirun, etc.)
Here's the stack trace of the orted process on node 01. The "uname" process was long gone (and had sent its output back with no difficulty). ============ Stopping process localhost:5321 (/osr/users/rountree/ompi-1.2.4_intel_threaded_debug/bin/orted). Thread received signal INT stopped at [<opaque> pthread_cond_wait@@GLIBC_2.3.2(...) 0x00002aaaab67a766] (idb) where >0 0x00002aaaab67a766 in pthread_cond_wait@@GLIBC_2.3.2(...) in >/lib64/libpthread-2.4.so #1 0x0000000000401fef in opal_condition_wait(c=0x5075c0, m=0x507580) "../../../opal/threads/condition.h":64 #2 0x0000000000403000 in main(argc=17, argv=0x7ffffd82cd38) "orted.c":525 #3 0x00002aaaab7a6e54 in __libc_start_main(...) in /lib64/libc-2.4.so #4 0x0000000000401c19 in _start(...) in /osr/users/rountree/ompi-1.2.4_intel_threaded_debug/bin/orted ============ The mpirun process on the root node isn't quite as useful. ============ Stopping process localhost:29856 (/osr/users/rountree/ompi-1.2.4_intel_threaded_debug/bin/orterun). Thread received signal INT stopped at [<opaque> poll(...) 0x00000039ef2c3806] (idb) where >0 0x00000039ef2c3806 in poll(...) in /lib64/libc-2.4.so #1 0x0000000040a000c0 ============ Let me know what other information would be helpful. Best, Barry > > If any of the orted's are still running, can you connect to them with > gdb and get a backtrace to see where they are hung? > > Likewise, can you connect to mpirun with gdb and get a backtrace of > where it's hung? > > Ralph, the main ORTE developer, is pretty sure that it's stuck in the > IO flushing routines that are executed at the end of time (look for > function names like iof_flush or similar). We thought we had fixed > all of those on the 1.2 branch, but perhaps there's some other weird > race condition happening that doesn't happen on our test machines... > > > > On Jan 13, 2008, at 10:17 AM, Barry Rountree wrote: > > > On Sun, Jan 13, 2008 at 09:54:47AM -0500, Barry Rountree wrote: > > > Hello, > > > > > > The following command > > > > > > mpirun -np 2 -hostfile ~/hostfile uptime > > > > > > will occasionally hang after completing. The expected output > > appears on > > > the screen, but mpirun needs a SIGKILL to return to the console. > > > > > > This has been verified with OpenMPI v1.2.4 compiled with both icc > > 9.1 > > > 20061101 (aka 9.1.045) and gcc 4.1.0 20060304 (aka Red Hat > > 4.1.0-3). I > > > have also tried earlier versions of OpenMPI and found the same bug > > > (1.1.2 and 1.2.2). > > > > > > Using -verbose didn't provide any additional output. I'm happy > > to help > > > tracking down whatever is causing this. > > > > A couple more data points: > > > > mpirun -np 4 -hostfile ~/hostfile --no-daemonize uptime > > > > hung twice over 100 runs. Without the --no-daemonize, the command > > hung > > 16 times over 100 runs. (This is using the version compiled with > > icc.) > > > > Barry > > > > > > > > Many thanks, > > > > > > Barry Rountree > > > Ph.D. Candidate, Computer Science > > > University of Georgia > > > > > > _______________________________________________ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > Cisco Systems > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users