On Fri, Jan 18, 2008 at 08:33:10PM -0500, Jeff Squyres wrote:
> Barry --
> 
> Could you check what apps are still running when it hangs?  I.e., I  
> assume that all the uptime's are dead; are all the orted's dead on the  
> remote nodes?  (orted = our helper process that is launched on the  
> remote nodes to exert process control, funnel I/O back and forth to  
> mpirun, etc.)

Here's the stack trace of the orted process on node 01.  The "uname" 
process was long gone (and had sent its output back with no difficulty).

============
Stopping process localhost:5321 
(/osr/users/rountree/ompi-1.2.4_intel_threaded_debug/bin/orted).
Thread received signal INT
stopped at [<opaque> pthread_cond_wait@@GLIBC_2.3.2(...) 0x00002aaaab67a766]
(idb) where
>0  0x00002aaaab67a766 in pthread_cond_wait@@GLIBC_2.3.2(...) in 
>/lib64/libpthread-2.4.so
#1  0x0000000000401fef in opal_condition_wait(c=0x5075c0, m=0x507580) 
"../../../opal/threads/condition.h":64
#2  0x0000000000403000 in main(argc=17, argv=0x7ffffd82cd38) "orted.c":525
#3  0x00002aaaab7a6e54 in __libc_start_main(...) in /lib64/libc-2.4.so
#4  0x0000000000401c19 in _start(...) in 
/osr/users/rountree/ompi-1.2.4_intel_threaded_debug/bin/orted
============

The mpirun process on the root node isn't quite as useful.


============
Stopping process localhost:29856 
(/osr/users/rountree/ompi-1.2.4_intel_threaded_debug/bin/orterun).
Thread received signal INT
stopped at [<opaque> poll(...) 0x00000039ef2c3806]
(idb) where
>0  0x00000039ef2c3806 in poll(...) in /lib64/libc-2.4.so
#1  0x0000000040a000c0
============

Let me know what other information would be helpful.  

Best,

Barry


> 
> If any of the orted's are still running, can you connect to them with  
> gdb and get a backtrace to see where they are hung?
> 
> Likewise, can you connect to mpirun with gdb and get a backtrace of  
> where it's hung?
> 
> Ralph, the main ORTE developer, is pretty sure that it's stuck in the  
> IO flushing routines that are executed at the end of time (look for  
> function names like iof_flush or similar).  We thought we had fixed  
> all of those on the 1.2 branch, but perhaps there's some other weird  
> race condition happening that doesn't happen on our test machines...
> 
> 
> 
> On Jan 13, 2008, at 10:17 AM, Barry Rountree wrote:
> 
> > On Sun, Jan 13, 2008 at 09:54:47AM -0500, Barry Rountree wrote:
> > > Hello,
> > >
> > > The following command
> > >
> > > mpirun -np 2 -hostfile ~/hostfile uptime
> > >
> > > will occasionally hang after completing.  The expected output  
> > appears on
> > > the screen, but mpirun needs a SIGKILL to return to the console.
> > >
> > > This has been verified with OpenMPI v1.2.4 compiled with both icc  
> > 9.1
> > > 20061101 (aka 9.1.045) and gcc 4.1.0 20060304 (aka Red Hat  
> > 4.1.0-3).  I
> > > have also tried earlier versions of OpenMPI and found the same bug
> > > (1.1.2 and 1.2.2).
> > >
> > > Using  -verbose didn't provide any additional output.  I'm happy  
> > to help
> > > tracking down whatever is causing this.
> >
> > A couple more data points:
> >
> > mpirun -np 4 -hostfile ~/hostfile --no-daemonize uptime
> >
> > hung twice over 100 runs.  Without the --no-daemonize, the command  
> > hung
> > 16 times over 100 runs.  (This is using the version compiled with  
> > icc.)
> >
> > Barry
> >
> > >
> > > Many thanks,
> > >
> > > Barry Rountree
> > > Ph.D. Candidate, Computer Science
> > > University of Georgia
> > >
> > > _______________________________________________
> > > users mailing list
> > > us...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> Cisco Systems
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to