Yeah, the system doesn't currently support enable-progress-threads. It is a 
two-fold problem: ORTE won't work that way, and some parts of the MPI layer 
won't either.

I am currently working on fixing ORTE so it will work with progress threads 
enabled. I believe (but can't confirm) that the TCP BTL will also work with 
that feature, but I have heard that the other BTL's won't (again, can't 
confirm).

I'll send out a note when ORTE is okay, but that won't be included in a release 
for awhile.

On Jan 8, 2010, at 9:38 AM, Dong Li wrote:

> Hi, guys.
> My application got stuck when I run an application with Open MPI 1.4
> with progress thead enabled.
> 
> The OpenMPI is configured and compiled with the following options.
> ./configure --with-openib=/usr --enable-trace --enable-debug
> --enable-peruse --enable-progress-threads
> 
> Then I started the application with two MPI processes, but it looks
> like there is some problem with orte and the mpiexec just stuck there
> and never run the application.
> I used gdb to attach to the mpiexec to find out where the program got
> stuck. The backtrace information is shown in the following for the two
> MPI progresses (i.e. the rank 0 and the rank 1). It looks to me that
> the problem happened in the rank 0 when it tries to do some atomic add
> operation. Note that my processor is Intel Xeon CPU E5462, but the
> open mpi tried to use some AMD64 instructions to conduct atomic add
> operations. Is this a bug or something?
> 
> Any comment? Thank you.
> 
> -Dong
> 
> 
> ***********************************************************************************************************************************************
> The following is for the rank 0.
> (gdb) bt
> #0  0x00007fbdd1c93264 in opal_atomic_cmpset_32 (addr=0x7fbdd1eede24,
> oldval=1, newval=0) at ../opal/include/opal/sys/amd64/atomic.h:94
> #1  0x00007fbdd1c93348 in opal_atomic_add_xx (addr=0x7fbdd1eede24,
> value=1, length=4) at ../opal/include/opal/sys/atomic_impl.h:243
> #2  0x00007fbdd1c932ad in opal_progress () at runtime/opal_progress.c:171
> #3  0x00007fbdd1f5c9ad in orte_plm_base_daemon_callback
> (num_daemons=1) at base/plm_base_launch_support.c:459
> #4  0x00007fbdd0a5579d in orte_plm_rsh_launch (jdata=0x60f070) at
> plm_rsh_module.c:1221
> #5  0x0000000000403821 in orterun (argc=15, argv=0x7fffda18a498) at
> orterun.c:748
> #6  0x0000000000402dc7 in main (argc=15, argv=0x7fffda18a498) at main.c:13
> ************************************************************************************************************************************************
> The following is for the rank 1.
> #0  0x0000003c4c20b309 in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib64/libpthread.so.0
> #1  0x00007f6f8b04ba56 in opal_condition_wait (c=0x656ce0, m=0x656c88)
> at ../../../../opal/threads/condition.h:78
> #2  0x00007f6f8b04b8b7 in orte_rml_oob_send (peer=0x7f6f8c578978,
> iov=0x7fff945798d0, count=1, tag=10, flags=16) at rml_oob_send.c:153
> #3  0x00007f6f8b04c197 in orte_rml_oob_send_buffer
> (peer=0x7f6f8c578978, buffer=0x6563b0, tag=10, flags=0) at
> rml_oob_send.c:269
> #4  0x00007f6f8c32fe24 in orte_daemon (argc=28, argv=0x7fff9457abd8)
> at orted/orted_main.c:610
> #5  0x0000000000400917 in main (argc=28, argv=0x7fff9457abd8) at orted.c:62
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to