Hello Barry,
I am guessing you are trying to use a threaded build of Open MPI...

Unfortunately, the threading support in Open MPI 1.2.x is not only not well
tested, it has many known problems.  We do not advise use of threading in
the Open MPI 1.2.x series.  We even added a warning in version 1.2.5 if
you try to use threading... specifically we added run-time warnings during
MPI_INIT when MPI_THREAD_MULTIPLE and/or progression threads are used.

We are targeting the 1.3 series to have threading support actually working.

On Jan 24, 2008 3:25 AM, Barry Rountree <rount...@cs.uga.edu> wrote:
> On Thu, Jan 24, 2008 at 03:01:40AM -0500, Barry Rountree wrote:
> > On Fri, Jan 18, 2008 at 08:33:10PM -0500, Jeff Squyres wrote:
> > > Barry --
> > >
> > > Could you check what apps are still running when it hangs?  I.e., I
> > > assume that all the uptime's are dead; are all the orted's dead on the
> > > remote nodes?  (orted = our helper process that is launched on the
> > > remote nodes to exert process control, funnel I/O back and forth to
> > > mpirun, etc.)
>
> One more bit of trivia -- when I ran my killall script across the nodes,
> there were four out of sixteen that had an orted process hanging out.
> If this is a synchronization problem, then most of the nodes are
> handling it fine.
>
> >
> > Here's the stack trace of the orted process on node 01.  The "uname"
> > process was long gone (and had sent its output back with no difficulty).
> >
> > ============
> > Stopping process localhost:5321 
> > (/osr/users/rountree/ompi-1.2.4_intel_threaded_debug/bin/orted).
> > Thread received signal INT
> > stopped at [<opaque> pthread_cond_wait@@GLIBC_2.3.2(...) 0x00002aaaab67a766]
> > (idb) where
> > >0  0x00002aaaab67a766 in pthread_cond_wait@@GLIBC_2.3.2(...) in 
> > >/lib64/libpthread-2.4.so
> > #1  0x0000000000401fef in opal_condition_wait(c=0x5075c0, m=0x507580) 
> > "../../../opal/threads/condition.h":64
> > #2  0x0000000000403000 in main(argc=17, argv=0x7ffffd82cd38) "orted.c":525
> > #3  0x00002aaaab7a6e54 in __libc_start_main(...) in /lib64/libc-2.4.so
> > #4  0x0000000000401c19 in _start(...) in 
> > /osr/users/rountree/ompi-1.2.4_intel_threaded_debug/bin/orted
> > ============
> >
> > The mpirun process on the root node isn't quite as useful.
> >
> >
> > ============
> > Stopping process localhost:29856 
> > (/osr/users/rountree/ompi-1.2.4_intel_threaded_debug/bin/orterun).
> > Thread received signal INT
> > stopped at [<opaque> poll(...) 0x00000039ef2c3806]
> > (idb) where
> > >0  0x00000039ef2c3806 in poll(...) in /lib64/libc-2.4.so
> > #1  0x0000000040a000c0
> > ============
> >
> > Let me know what other information would be helpful.
> >
> > Best,
> >
> > Barry
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
 tmat...@gmail.com || timat...@open-mpi.org
    I'm a bright... http://www.the-brights.net/

Reply via email to