Yeah, the system doesn't currently support enable-progress-threads. It is a two-fold problem: ORTE won't work that way, and some parts of the MPI layer won't either.
I am currently working on fixing ORTE so it will work with progress threads enabled. I believe (but can't confirm) that the TCP BTL will also work with that feature, but I have heard that the other BTL's won't (again, can't confirm). I'll send out a note when ORTE is okay, but that won't be included in a release for awhile. On Jan 8, 2010, at 9:38 AM, Dong Li wrote: > Hi, guys. > My application got stuck when I run an application with Open MPI 1.4 > with progress thead enabled. > > The OpenMPI is configured and compiled with the following options. > ./configure --with-openib=/usr --enable-trace --enable-debug > --enable-peruse --enable-progress-threads > > Then I started the application with two MPI processes, but it looks > like there is some problem with orte and the mpiexec just stuck there > and never run the application. > I used gdb to attach to the mpiexec to find out where the program got > stuck. The backtrace information is shown in the following for the two > MPI progresses (i.e. the rank 0 and the rank 1). It looks to me that > the problem happened in the rank 0 when it tries to do some atomic add > operation. Note that my processor is Intel Xeon CPU E5462, but the > open mpi tried to use some AMD64 instructions to conduct atomic add > operations. Is this a bug or something? > > Any comment? Thank you. > > -Dong > > > *********************************************************************************************************************************************** > The following is for the rank 0. > (gdb) bt > #0 0x00007fbdd1c93264 in opal_atomic_cmpset_32 (addr=0x7fbdd1eede24, > oldval=1, newval=0) at ../opal/include/opal/sys/amd64/atomic.h:94 > #1 0x00007fbdd1c93348 in opal_atomic_add_xx (addr=0x7fbdd1eede24, > value=1, length=4) at ../opal/include/opal/sys/atomic_impl.h:243 > #2 0x00007fbdd1c932ad in opal_progress () at runtime/opal_progress.c:171 > #3 0x00007fbdd1f5c9ad in orte_plm_base_daemon_callback > (num_daemons=1) at base/plm_base_launch_support.c:459 > #4 0x00007fbdd0a5579d in orte_plm_rsh_launch (jdata=0x60f070) at > plm_rsh_module.c:1221 > #5 0x0000000000403821 in orterun (argc=15, argv=0x7fffda18a498) at > orterun.c:748 > #6 0x0000000000402dc7 in main (argc=15, argv=0x7fffda18a498) at main.c:13 > ************************************************************************************************************************************************ > The following is for the rank 1. > #0 0x0000003c4c20b309 in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > #1 0x00007f6f8b04ba56 in opal_condition_wait (c=0x656ce0, m=0x656c88) > at ../../../../opal/threads/condition.h:78 > #2 0x00007f6f8b04b8b7 in orte_rml_oob_send (peer=0x7f6f8c578978, > iov=0x7fff945798d0, count=1, tag=10, flags=16) at rml_oob_send.c:153 > #3 0x00007f6f8b04c197 in orte_rml_oob_send_buffer > (peer=0x7f6f8c578978, buffer=0x6563b0, tag=10, flags=0) at > rml_oob_send.c:269 > #4 0x00007f6f8c32fe24 in orte_daemon (argc=28, argv=0x7fff9457abd8) > at orted/orted_main.c:610 > #5 0x0000000000400917 in main (argc=28, argv=0x7fff9457abd8) at orted.c:62 > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users