I'm having trouble with an application (CosmoMC; <http://cosmologist.info/cosmomc/>) that can use both OpenMPI and OpenMP.
I have several Opteron boxes, each with 2 * dual core CPUs. I want to run the application with 4 MPI threads (one per box), each of which in turn splits into 4 OpenMP threads (one per CPU). The code is Fortran 90, and the compiler is the Intel Fortran Compiler Version 8.1. OpenMPI v1.0.1 works fine (communicating between boxes or amongst the CPUs in a single box) without OpenMP, and OpenMP works fine without OpenMPI. The combination OpenMP + OpenMPI works fine if I restrict the application to only 1 OpenMP thread per MPI process (in other words the code at least compiles and runs fine with both options on, in this limited sense). If I try to use my desired value of 4 OpenMP threads, it crashes. It works fine, however, if I use MPICH for the MPI implementation. The hostfile specifies "slots=4 max-slots=4" for each host (trying to lie and say "slots=1" die not help), and I use "-np 4 --bynode" to get only one MPI process per host. I'm using ssh over Gbit ethernet between hosts. There is no useful error message that I can see. Watching top, I can see that processes are spawned on the four hosts, split into 4 OpenMP threads, and then crash immediately. The only error message is: mpirun noticed that job rank 0 with PID 30243 on node "coma006" exited on signal 11. Broken pipe Using mpirun -d reveals nothing useful to me (see end of message). I realize this is all rather vague. Any advice, or tips for debugging (or OpenMPI + OpenMP success stories!) appreciated. TIA. [coma006:30450] Info: Setting up debugger process table for applications MPIR_being_debugged = 0 MPIR_debug_gate = 0 MPIR_debug_state = 1 MPIR_acquired_pre_main = 0 MPIR_i_am_starter = 0 MPIR_proctable_size = 4 MPIR_proctable: (i, host, exe, pid) = (0, coma003, ./cosmomc, 20847) (i, host, exe, pid) = (1, coma004, ./cosmomc, 21622) (i, host, exe, pid) = (2, coma005, ./cosmomc, 22080) (i, host, exe, pid) = (3, coma006, ./cosmomc, 30461) [coma006:30450] spawn: in job_state_callback(jobid = 1, state = 0x4) [coma006:30461] [0,1,0] ompi_mpi_init completed [coma004:21622] [0,1,2] ompi_mpi_init completed [coma005:22080] [0,1,1] ompi_mpi_init completed [coma003:20847] [0,1,3] ompi_mpi_init completed <snip application output> [coma005:22079] sess_dir_finalize: found proc session dir empty - deleting [coma005:22079] sess_dir_finalize: found job session dir empty - deleting [coma005:22079] sess_dir_finalize: univ session dir not empty - leaving [coma006:30450] spawn: in job_state_callback(jobid = 1, state = 0xa) [coma006:30451] orted: job_state_callback(jobid = 1, state = ORTE_PROC_STATE_ABORTED) [coma005:22079] orted: job_state_callback(jobid = 1, state = ORTE_PROC_STATE_ABORTED) [coma004:21621] orted: job_state_callback(jobid = 1, state = ORTE_PROC_STATE_ABORTED) mpirun noticed that job rank 1 with PID 22080 on node "coma005" exited on signal 11. [coma003:20846] orted: job_state_callback(jobid = 1, state = ORTE_PROC_STATE_ABORTED) [coma005:22079] sess_dir_finalize: found proc session dir empty - deleting [coma005:22079] sess_dir_finalize: found job session dir empty - deleting [coma005:22079] sess_dir_finalize: found univ session dir empty - deleting [coma005:22079] sess_dir_finalize: top session dir not empty - leaving <repeated for other hosts> 3 processes killed (possibly by Open MPI) [coma006:30451] orted: job_state_callback(jobid = 1, state = ORTE_PROC_STATE_TERMINATED) [coma006:30451] sess_dir_finalize: found proc session dir empty - deleting [coma006:30451] sess_dir_finalize: job session dir not empty - leaving