On Thu, Jan 24, 2008 at 03:01:40AM -0500, Barry Rountree wrote: > On Fri, Jan 18, 2008 at 08:33:10PM -0500, Jeff Squyres wrote: > > Barry -- > > > > Could you check what apps are still running when it hangs? I.e., I > > assume that all the uptime's are dead; are all the orted's dead on the > > remote nodes? (orted = our helper process that is launched on the > > remote nodes to exert process control, funnel I/O back and forth to > > mpirun, etc.)
One more bit of trivia -- when I ran my killall script across the nodes, there were four out of sixteen that had an orted process hanging out. If this is a synchronization problem, then most of the nodes are handling it fine. > > Here's the stack trace of the orted process on node 01. The "uname" > process was long gone (and had sent its output back with no difficulty). > > ============ > Stopping process localhost:5321 > (/osr/users/rountree/ompi-1.2.4_intel_threaded_debug/bin/orted). > Thread received signal INT > stopped at [<opaque> pthread_cond_wait@@GLIBC_2.3.2(...) 0x00002aaaab67a766] > (idb) where > >0 0x00002aaaab67a766 in pthread_cond_wait@@GLIBC_2.3.2(...) in > >/lib64/libpthread-2.4.so > #1 0x0000000000401fef in opal_condition_wait(c=0x5075c0, m=0x507580) > "../../../opal/threads/condition.h":64 > #2 0x0000000000403000 in main(argc=17, argv=0x7ffffd82cd38) "orted.c":525 > #3 0x00002aaaab7a6e54 in __libc_start_main(...) in /lib64/libc-2.4.so > #4 0x0000000000401c19 in _start(...) in > /osr/users/rountree/ompi-1.2.4_intel_threaded_debug/bin/orted > ============ > > The mpirun process on the root node isn't quite as useful. > > > ============ > Stopping process localhost:29856 > (/osr/users/rountree/ompi-1.2.4_intel_threaded_debug/bin/orterun). > Thread received signal INT > stopped at [<opaque> poll(...) 0x00000039ef2c3806] > (idb) where > >0 0x00000039ef2c3806 in poll(...) in /lib64/libc-2.4.so > #1 0x0000000040a000c0 > ============ > > Let me know what other information would be helpful. > > Best, > > Barry