I'm having trouble with an application (CosmoMC;
<http://cosmologist.info/cosmomc/>) that can use both OpenMPI and
OpenMP.

I have several Opteron boxes, each with 2 * dual core CPUs. I want to
run the application with 4 MPI threads (one per box), each of which in
turn splits into 4 OpenMP threads (one per CPU).

The code is Fortran 90, and the compiler is the Intel Fortran Compiler
Version 8.1. OpenMPI v1.0.1 works fine (communicating between boxes or
amongst the CPUs in a single box) without OpenMP, and OpenMP works
fine without OpenMPI.

The combination OpenMP + OpenMPI works fine if I restrict the
application to only 1 OpenMP thread per MPI process (in other words
the code at least compiles and runs fine with both options on, in this
limited sense). If I try to use my desired value of 4 OpenMP threads,
it crashes. It works fine, however, if I use MPICH for the MPI
implementation.

The hostfile specifies "slots=4 max-slots=4" for each host (trying to
lie and say "slots=1" die not help), and I use "-np 4 --bynode" to get
only one MPI process per host. I'm using ssh over Gbit ethernet
between hosts.

There is no useful error message that I can see. Watching top, I can
see that processes are spawned on the four hosts, split into 4 OpenMP
threads, and then crash immediately. The only error message is:

    mpirun noticed that job rank 0 with PID 30243 on node "coma006"
    exited on signal 11.
    Broken pipe


Using mpirun -d reveals nothing useful to me (see end of message).


I realize this is all rather vague. Any advice, or tips for debugging
(or OpenMPI + OpenMP success stories!) appreciated.


TIA.


[coma006:30450] Info: Setting up debugger process table for applications
  MPIR_being_debugged = 0
  MPIR_debug_gate = 0
  MPIR_debug_state = 1
  MPIR_acquired_pre_main = 0
  MPIR_i_am_starter = 0
  MPIR_proctable_size = 4
  MPIR_proctable:
    (i, host, exe, pid) = (0, coma003, ./cosmomc, 20847)
    (i, host, exe, pid) = (1, coma004, ./cosmomc, 21622)
    (i, host, exe, pid) = (2, coma005, ./cosmomc, 22080)
    (i, host, exe, pid) = (3, coma006, ./cosmomc, 30461)
[coma006:30450] spawn: in job_state_callback(jobid = 1, state = 0x4)
[coma006:30461] [0,1,0] ompi_mpi_init completed
[coma004:21622] [0,1,2] ompi_mpi_init completed
[coma005:22080] [0,1,1] ompi_mpi_init completed
[coma003:20847] [0,1,3] ompi_mpi_init completed
<snip application output>
[coma005:22079] sess_dir_finalize: found proc session dir empty - deleting
[coma005:22079] sess_dir_finalize: found job session dir empty - deleting
[coma005:22079] sess_dir_finalize: univ session dir not empty - leaving
[coma006:30450] spawn: in job_state_callback(jobid = 1, state = 0xa)
[coma006:30451] orted: job_state_callback(jobid = 1, state = 
ORTE_PROC_STATE_ABORTED)
[coma005:22079] orted: job_state_callback(jobid = 1, state = 
ORTE_PROC_STATE_ABORTED)
[coma004:21621] orted: job_state_callback(jobid = 1, state = 
ORTE_PROC_STATE_ABORTED)
mpirun noticed that job rank 1 with PID 22080 on node "coma005" exited on 
signal 11.
[coma003:20846] orted: job_state_callback(jobid = 1, state = 
ORTE_PROC_STATE_ABORTED)
[coma005:22079] sess_dir_finalize: found proc session dir empty - deleting
[coma005:22079] sess_dir_finalize: found job session dir empty - deleting
[coma005:22079] sess_dir_finalize: found univ session dir empty - deleting
[coma005:22079] sess_dir_finalize: top session dir not empty - leaving
<repeated for other hosts>
3 processes killed (possibly by Open MPI)
[coma006:30451] orted: job_state_callback(jobid = 1, state = 
ORTE_PROC_STATE_TERMINATED)
[coma006:30451] sess_dir_finalize: found proc session dir empty - deleting
[coma006:30451] sess_dir_finalize: job session dir not empty - leaving

Reply via email to