Hi,
I'm getting the same hangs in my environment and will contribute my findings. The info from ompi_info, debug output and application source is attached.
When running on one machine and two processes like this: mpirun -np 2 -mca orte_debug 1 mpitest The application successfully executes and terminates. When running on two machines and two processes like this: mpirun -hostlist nodelist -np 2 -mca orte_debug 1 mpitestThe application hangs. Both the mpitest application and orted is in the process list on both machines so they have been started. I have also tried to have only the local host in the nodelist and this works.
-Arnstein Jeff Squyres wrote:
Hugh --We are actually unable to replicate the problem; we've run some single-threaded and multi-threaded apps with no problems. This is unfortunately probably symptomatic of bugs that are still remaining in the code. :-(Can you try disabling MPI progress threads (I believe that tcp may be the only BTL component that has async progress support implemented anyway; sm *may*, but I'd have to go back and check)? Leave MPI threads enabled (i.e., MPI_THREAD_MULTIPLE) and see if that gets you further.Hugh Merz wrote:It's still only lightly tested. I'm surprised that it totally hangs for you, though -- what is your simple test program doing?It just initializes mpi (tried both mpi_init and mpi_init_thread), prints a string and exits. It works fine without thread support compiled into ompi.It happens with any mpi program I try. Attaching gdb to each thread of the executable gives: (original process) #0 0x420293d5 in sigsuspend () from /lib/i686/libc.so.6 #1 0x401e8609 in __pthread_wait_for_restart_signal () from /lib/i686/libpthread.so.0 #2 0x401e4eec in pthread_cond_wait () from /lib/i686/libpthread.so.0 #3 0x40bda418 in mca_oob_tcp_msg_wait () from /opt/openmpi-1.0rc2_asynch/lib/openmpi/mca_oob_tcp.so (thread 1) #0 0x420e01a7 in poll () from /lib/i686/libc.so.6 #1 0x401e5c30 in __pthread_manager () from /lib/i686/libpthread.so.0 (thread 2) #0 0x420e01a7 in poll () from /lib/i686/libc.so.6 #1 0x4013268b in poll_dispatch () from /opt/openmpi-1.0rc2_asynch/lib/libopal.so.0 Cannot access memory at address 0x3e8 (thread 3) #0 0x420dae14 in read () from /lib/i686/libc.so.6 #1 0x401f3b18 in __DTOR_END__ () from /lib/i686/libpthread.so.0 #2 0x40c8dfe3 in mca_btl_sm_component_event_thread () from /opt/openmpi-1.0rc2_asynch/lib/openmpi/mca_btl_sm.soAnd there are also 2 additional threads spawned by each of mpirun and orted.Any clues or hints on how to debug this would be appreciated, but I understand that it is probably not high priority right now.Thanks, HughHugh Merz wrote:Howdy, I tried installing the release candidate with thread support enabled ( --enable-mpi-threads and --enable-progress-threads ) using an old rh7.3 install and a recent fc4 install (Intel compilers). When I try to run a simple test program, the executable, mpirun and orted all sleep in what appears to be a deadlock. If I compile ompi without threads everything works fine. The faq states that thread support has only been lightly tested, and there was only brief discussion about it in the maillist 8 months ago - have there been any developments, and should I expect it to work properly? Thanks, Hugh _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users-- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open-mpi.org/ _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users_______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
files.tgz
Description: application/compressed-tar