users-requ...@open-mpi.org wrote:
A few clarifying questions:

What is your netmask on these hosts?

Where is the MPI_ALLREDUCE in your app -- right away, or somewhere deep
within the application?  Can you replicate this with a simple MPI
application that essentially calls MPI_INIT, MPI_ALLREDUCE, and
MPI_FINALIZE?

Can you replicate this with a simple MPI app that does an MPI_SEND /
MPI_RECV between two processes on the different subnets?
Thanks.


@ Jeff,

netmask 255.255.255.0

Running a simple "hello world" yields no error on each subnet, but running "hello world" on both subnets yields the error

[g5dual.3-net:00436] *** An error occurred in MPI_Send
[g5dual.3-net:00436] *** on communicator MPI_COMM_WORLD
[g5dual.3-net:00436] *** MPI_ERR_INTERN: internal error
[g5dual.3-net:00436] *** MPI_ERRORS_ARE_FATAL (goodbye)

Hope this helps!

Frank


Just in case you wanna check the source:
c    Fortran example hello_world
     program hello
     include 'mpif.h'
     integer rank, size, ierror, tag, status(MPI_STATUS_SIZE)
     character*12 message

     call MPI_INIT(ierror)
     call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
     call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
     tag = 100

     if (rank .eq. 0) then
       message = 'Hello, world'
       do i=1, size-1
         call MPI_SEND(message, 12, MPI_CHARACTER, i, tag,
    &                  MPI_COMM_WORLD, ierror)
       enddo

     else
       call MPI_RECV(message, 12, MPI_CHARACTER, 0, tag,
    &                MPI_COMM_WORLD, status, ierror)
     endif

     print*, 'node', rank, ':', message
     call MPI_FINALIZE(ierror)
     end


or the full output:

[powerbook:/Network/CFD/hello] motte% mpirun -d -np 5 --hostfile ./hostfile /Network/CFD/hello/hello_world
[powerbook.2-net:00606] [0,0,0] setting up session dir with
[powerbook.2-net:00606]         universe default-universe
[powerbook.2-net:00606]         user motte
[powerbook.2-net:00606]         host powerbook.2-net
[powerbook.2-net:00606]         jobid 0
[powerbook.2-net:00606]         procid 0
[powerbook.2-net:00606] procdir: /tmp/openmpi-sessions-motte@powerbook.2-net_0/default-universe/0/0 [powerbook.2-net:00606] jobdir: /tmp/openmpi-sessions-motte@powerbook.2-net_0/default-universe/0 [powerbook.2-net:00606] unidir: /tmp/openmpi-sessions-motte@powerbook.2-net_0/default-universe
[powerbook.2-net:00606] top: openmpi-sessions-motte@powerbook.2-net_0
[powerbook.2-net:00606] tmp: /tmp
[powerbook.2-net:00606] [0,0,0] contact_file /tmp/openmpi-sessions-motte@powerbook.2-net_0/default-universe/universe-setup.txt
[powerbook.2-net:00606] [0,0,0] wrote setup file
[powerbook.2-net:00606] pls:rsh: local csh: 1, local bash: 0
[powerbook.2-net:00606] pls:rsh: assuming same remote shell as local shell
[powerbook.2-net:00606] pls:rsh: remote csh: 1, remote bash: 0
[powerbook.2-net:00606] pls:rsh: final template argv:
[powerbook.2-net:00606] pls:rsh: /usr/bin/ssh <template> orted --debug --bootproxy 1 --name <template> --num_procs 6 --vpid_start 0 --nodename <template> --universe motte@powerbook.2-net:default-universe --nsreplica "0.0.0;tcp://192.168.2.3:49443" --gprreplica "0.0.0;tcp://192.168.2.3:49443" --mpi-call-yield 0
[powerbook.2-net:00606] pls:rsh: launching on node Powerbook.2-net
[powerbook.2-net:00606] pls:rsh: not oversubscribed -- setting mpi_yield_when_idle to 0
[powerbook.2-net:00606] pls:rsh: Powerbook.2-net is a LOCAL node
[powerbook.2-net:00606] pls:rsh: changing to directory /Users/motte
[powerbook.2-net:00606] pls:rsh: executing: orted --debug --bootproxy 1 --name 0.0.1 --num_procs 6 --vpid_start 0 --nodename Powerbook.2-net --universe motte@powerbook.2-net:default-universe --nsreplica "0.0.0;tcp://192.168.2.3:49443" --gprreplica "0.0.0;tcp://192.168.2.3:49443" --mpi-call-yield 0
[powerbook.2-net:00607] [0,0,1] setting up session dir with
[powerbook.2-net:00607]         universe default-universe
[powerbook.2-net:00607]         user motte
[powerbook.2-net:00607]         host Powerbook.2-net
[powerbook.2-net:00607]         jobid 0
[powerbook.2-net:00607]         procid 1
[powerbook.2-net:00607] procdir: /tmp/openmpi-sessions-motte@Powerbook.2-net_0/default-universe/0/1 [powerbook.2-net:00607] jobdir: /tmp/openmpi-sessions-motte@Powerbook.2-net_0/default-universe/0 [powerbook.2-net:00607] unidir: /tmp/openmpi-sessions-motte@Powerbook.2-net_0/default-universe
[powerbook.2-net:00607] top: openmpi-sessions-motte@Powerbook.2-net_0
[powerbook.2-net:00607] tmp: /tmp
[powerbook.2-net:00606] pls:rsh: launching on node g4d003.3-net
[powerbook.2-net:00606] pls:rsh: not oversubscribed -- setting mpi_yield_when_idle to 0
[powerbook.2-net:00606] pls:rsh: g4d003.3-net is a REMOTE node
[powerbook.2-net:00606] pls:rsh: executing: /usr/bin/ssh g4d003.3-net orted --debug --bootproxy 1 --name 0.0.2 --num_procs 6 --vpid_start 0 --nodename g4d003.3-net --universe motte@powerbook.2-net:default-universe --nsreplica "0.0.0;tcp://192.168.2.3:49443" --gprreplica "0.0.0;tcp://192.168.2.3:49443" --mpi-call-yield 0
[g4d003.3-net:00411] [0,0,2] setting up session dir with
[g4d003.3-net:00411]    universe default-universe
[g4d003.3-net:00411]    user motte
[g4d003.3-net:00411]    host g4d003.3-net
[g4d003.3-net:00411]    jobid 0
[g4d003.3-net:00411]    procid 2
[g4d003.3-net:00411] procdir: /tmp/openmpi-sessions-motte@g4d003.3-net_0/default-universe/0/2 [g4d003.3-net:00411] jobdir: /tmp/openmpi-sessions-motte@g4d003.3-net_0/default-universe/0 [g4d003.3-net:00411] unidir: /tmp/openmpi-sessions-motte@g4d003.3-net_0/default-universe
[g4d003.3-net:00411] top: openmpi-sessions-motte@g4d003.3-net_0
[g4d003.3-net:00411] tmp: /tmp
[powerbook.2-net:00606] pls:rsh: launching on node g4d002.3-net
[powerbook.2-net:00606] pls:rsh: not oversubscribed -- setting mpi_yield_when_idle to 0
[powerbook.2-net:00606] pls:rsh: g4d002.3-net is a REMOTE node
[powerbook.2-net:00606] pls:rsh: executing: /usr/bin/ssh g4d002.3-net orted --debug --bootproxy 1 --name 0.0.3 --num_procs 6 --vpid_start 0 --nodename g4d002.3-net --universe motte@powerbook.2-net:default-universe --nsreplica "0.0.0;tcp://192.168.2.3:49443" --gprreplica "0.0.0;tcp://192.168.2.3:49443" --mpi-call-yield 0
[powerbook.2-net:00606] pls:rsh: launching on node g4d001.3-net
[powerbook.2-net:00606] pls:rsh: not oversubscribed -- setting mpi_yield_when_idle to 0
[powerbook.2-net:00606] pls:rsh: g4d001.3-net is a REMOTE node
[powerbook.2-net:00606] pls:rsh: executing: /usr/bin/ssh g4d001.3-net orted --debug --bootproxy 1 --name 0.0.4 --num_procs 6 --vpid_start 0 --nodename g4d001.3-net --universe motte@powerbook.2-net:default-universe --nsreplica "0.0.0;tcp://192.168.2.3:49443" --gprreplica "0.0.0;tcp://192.168.2.3:49443" --mpi-call-yield 0
[powerbook.2-net:00606] pls:rsh: launching on node G5Dual.3-net
[powerbook.2-net:00606] pls:rsh: not oversubscribed -- setting mpi_yield_when_idle to 0
[powerbook.2-net:00606] pls:rsh: G5Dual.3-net is a REMOTE node
[powerbook.2-net:00606] pls:rsh: executing: /usr/bin/ssh G5Dual.3-net orted --debug --bootproxy 1 --name 0.0.5 --num_procs 6 --vpid_start 0 --nodename G5Dual.3-net --universe motte@powerbook.2-net:default-universe --nsreplica "0.0.0;tcp://192.168.2.3:49443" --gprreplica "0.0.0;tcp://192.168.2.3:49443" --mpi-call-yield 0
[g4d001.3-net:00336] [0,0,4] setting up session dir with
[g4d001.3-net:00336]    universe default-universe
[g4d001.3-net:00336]    user motte
[g4d001.3-net:00336]    host g4d001.3-net
[g4d001.3-net:00336]    jobid 0
[g4d001.3-net:00336]    procid 4
[g4d001.3-net:00336] procdir: /tmp/openmpi-sessions-motte@g4d001.3-net_0/default-universe/0/4 [g4d001.3-net:00336] jobdir: /tmp/openmpi-sessions-motte@g4d001.3-net_0/default-universe/0 [g4d001.3-net:00336] unidir: /tmp/openmpi-sessions-motte@g4d001.3-net_0/default-universe
[g4d001.3-net:00336] top: openmpi-sessions-motte@g4d001.3-net_0
[g4d001.3-net:00336] tmp: /tmp
[g4d002.3-net:00279] [0,0,3] setting up session dir with
[g4d002.3-net:00279]    universe default-universe
[g4d002.3-net:00279]    user motte
[g4d002.3-net:00279]    host g4d002.3-net
[g4d002.3-net:00279]    jobid 0
[g4d002.3-net:00279]    procid 3
[g4d002.3-net:00279] procdir: /tmp/openmpi-sessions-motte@g4d002.3-net_0/default-universe/0/3 [g4d002.3-net:00279] jobdir: /tmp/openmpi-sessions-motte@g4d002.3-net_0/default-universe/0 [g4d002.3-net:00279] unidir: /tmp/openmpi-sessions-motte@g4d002.3-net_0/default-universe
[g4d002.3-net:00279] top: openmpi-sessions-motte@g4d002.3-net_0
[g4d002.3-net:00279] tmp: /tmp
[g5dual.3-net:00434] [0,0,5] setting up session dir with
[g5dual.3-net:00434]    universe default-universe
[g5dual.3-net:00434]    user motte
[g5dual.3-net:00434]    host G5Dual.3-net
[g5dual.3-net:00434]    jobid 0
[g5dual.3-net:00434]    procid 5
[g5dual.3-net:00434] procdir: /tmp/openmpi-sessions-motte@G5Dual.3-net_0/default-universe/0/5 [g5dual.3-net:00434] jobdir: /tmp/openmpi-sessions-motte@G5Dual.3-net_0/default-universe/0 [g5dual.3-net:00434] unidir: /tmp/openmpi-sessions-motte@G5Dual.3-net_0/default-universe
[g5dual.3-net:00434] top: openmpi-sessions-motte@G5Dual.3-net_0
[g5dual.3-net:00434] tmp: /tmp
[powerbook.2-net:00613] [0,1,4] setting up session dir with
[powerbook.2-net:00613]         universe default-universe
[powerbook.2-net:00613]         user motte
[powerbook.2-net:00613]         host Powerbook.2-net
[powerbook.2-net:00613]         jobid 1
[powerbook.2-net:00613]         procid 4
[powerbook.2-net:00613] procdir: /tmp/openmpi-sessions-motte@Powerbook.2-net_0/default-universe/1/4 [powerbook.2-net:00613] jobdir: /tmp/openmpi-sessions-motte@Powerbook.2-net_0/default-universe/1 [powerbook.2-net:00613] unidir: /tmp/openmpi-sessions-motte@Powerbook.2-net_0/default-universe
[powerbook.2-net:00613] top: openmpi-sessions-motte@Powerbook.2-net_0
[powerbook.2-net:00613] tmp: /tmp
[g5dual.3-net:00436] [0,1,0] setting up session dir with
[g5dual.3-net:00436]    universe default-universe
[g5dual.3-net:00436]    user motte
[g5dual.3-net:00436]    host G5Dual.3-net
[g5dual.3-net:00436]    jobid 1
[g5dual.3-net:00436]    procid 0
[g5dual.3-net:00436] procdir: /tmp/openmpi-sessions-motte@G5Dual.3-net_0/default-universe/1/0 [g5dual.3-net:00436] jobdir: /tmp/openmpi-sessions-motte@G5Dual.3-net_0/default-universe/1 [g5dual.3-net:00436] unidir: /tmp/openmpi-sessions-motte@G5Dual.3-net_0/default-universe
[g5dual.3-net:00436] top: openmpi-sessions-motte@G5Dual.3-net_0
[g5dual.3-net:00436] tmp: /tmp
[g4d001.3-net:00338] [0,1,1] setting up session dir with
[g4d001.3-net:00338]    universe default-universe
[g4d001.3-net:00338]    user motte
[g4d001.3-net:00338]    host g4d001.3-net
[g4d001.3-net:00338]    jobid 1
[g4d001.3-net:00338]    procid 1
[g4d001.3-net:00338] procdir: /tmp/openmpi-sessions-motte@g4d001.3-net_0/default-universe/1/1 [g4d001.3-net:00338] jobdir: /tmp/openmpi-sessions-motte@g4d001.3-net_0/default-universe/1 [g4d001.3-net:00338] unidir: /tmp/openmpi-sessions-motte@g4d001.3-net_0/default-universe
[g4d001.3-net:00338] top: openmpi-sessions-motte@g4d001.3-net_0
[g4d001.3-net:00338] tmp: /tmp
[g4d003.3-net:00413] [0,1,3] setting up session dir with
[g4d003.3-net:00413]    universe default-universe
[g4d003.3-net:00413]    user motte
[g4d003.3-net:00413]    host g4d003.3-net
[g4d003.3-net:00413]    jobid 1
[g4d003.3-net:00413]    procid 3
[g4d003.3-net:00413] procdir: /tmp/openmpi-sessions-motte@g4d003.3-net_0/default-universe/1/3 [g4d003.3-net:00413] jobdir: /tmp/openmpi-sessions-motte@g4d003.3-net_0/default-universe/1 [g4d003.3-net:00413] unidir: /tmp/openmpi-sessions-motte@g4d003.3-net_0/default-universe
[g4d003.3-net:00413] top: openmpi-sessions-motte@g4d003.3-net_0
[g4d003.3-net:00413] tmp: /tmp
[g4d002.3-net:00281] [0,1,2] setting up session dir with
[g4d002.3-net:00281]    universe default-universe
[g4d002.3-net:00281]    user motte
[g4d002.3-net:00281]    host g4d002.3-net
[g4d002.3-net:00281]    jobid 1
[g4d002.3-net:00281]    procid 2
[g4d002.3-net:00281] procdir: /tmp/openmpi-sessions-motte@g4d002.3-net_0/default-universe/1/2 [g4d002.3-net:00281] jobdir: /tmp/openmpi-sessions-motte@g4d002.3-net_0/default-universe/1 [g4d002.3-net:00281] unidir: /tmp/openmpi-sessions-motte@g4d002.3-net_0/default-universe
[g4d002.3-net:00281] top: openmpi-sessions-motte@g4d002.3-net_0
[g4d002.3-net:00281] tmp: /tmp
[powerbook.2-net:00606] spawn: in job_state_callback(jobid = 1, state = 0x4)
[powerbook.2-net:00606] Info: Setting up debugger process table for applications
 MPIR_being_debugged = 0
 MPIR_debug_gate = 0
 MPIR_debug_state = 1
 MPIR_acquired_pre_main = 0
 MPIR_i_am_starter = 0
 MPIR_proctable_size = 5
 MPIR_proctable:
(i, host, exe, pid) = (0, G5Dual.3-net, /Network/CFD/hello/hello_world, 436) (i, host, exe, pid) = (1, g4d001.3-net, /Network/CFD/hello/hello_world, 338) (i, host, exe, pid) = (2, g4d002.3-net, /Network/CFD/hello/hello_world, 281) (i, host, exe, pid) = (3, g4d003.3-net, /Network/CFD/hello/hello_world, 413) (i, host, exe, pid) = (4, Powerbook.2-net, /Network/CFD/hello/hello_world, 613)
[powerbook.2-net:00613] [0,1,4] ompi_mpi_init completed
[g4d001.3-net:00338] [0,1,1] ompi_mpi_init completed
[g5dual.3-net:00436] [0,1,0] ompi_mpi_init completed
[g4d003.3-net:00413] [0,1,3] ompi_mpi_init completed
[g4d002.3-net:00281] [0,1,2] ompi_mpi_init completed
node           1 :Hello, world
node           2 :Hello, world node           3 :Hello, world
[g5dual.3-net:00436] *** An error occurred in MPI_Send

[g5dual.3-net:00436] *** on communicator MPI_COMM_WORLD
[g5dual.3-net:00436] *** MPI_ERR_INTERN: internal error
[g5dual.3-net:00436] *** MPI_ERRORS_ARE_FATAL (goodbye)
--------------------------------------------------------------------------
WARNING: A process refused to die!

Host: powerbook.2-net
PID:  613

This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: A process refused to die!

Host: g4d003.3-net
PID:  413

This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: A process refused to die!

Host: g5dual.3-net
PID:  436

This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: A process refused to die!

Host: g4d002.3-net
PID:  281

This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: A process refused to die!

Host: g4d001.3-net
PID:  338

This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
[g5dual.3-net:00434] sess_dir_finalize: found proc session dir empty - deleting [g5dual.3-net:00434] sess_dir_finalize: found job session dir empty - deleting
[g5dual.3-net:00434] sess_dir_finalize: univ session dir not empty - leaving
[powerbook.2-net:00607] orted: job_state_callback(jobid = 1, state = ORTE_PROC_STATE_ABORTED) [g5dual.3-net:00434] orted: job_state_callback(jobid = 1, state = ORTE_PROC_STATE_ABORTED) [g4d003.3-net:00411] orted: job_state_callback(jobid = 1, state = ORTE_PROC_STATE_ABORTED) [g4d001.3-net:00336] orted: job_state_callback(jobid = 1, state = ORTE_PROC_STATE_ABORTED)
[g5dual.3-net:00434] sess_dir_finalize: job session dir not empty - leaving
[g5dual.3-net:00434] sess_dir_finalize: found proc session dir empty - deleting [g5dual.3-net:00434] sess_dir_finalize: found job session dir empty - deleting [g5dual.3-net:00434] sess_dir_finalize: found univ session dir empty - deleting [g5dual.3-net:00434] sess_dir_finalize: found top session dir empty - deleting [g4d002.3-net:00279] orted: job_state_callback(jobid = 1, state = ORTE_PROC_STATE_ABORTED) [g4d002.3-net:00279] sess_dir_finalize: found job session dir empty - deleting
[g4d002.3-net:00279] sess_dir_finalize: univ session dir not empty - leaving
[g4d002.3-net:00279] sess_dir_finalize: proc session dir not empty - leaving
--------------------------------------------------------------------------
WARNING: A process refused to die!

Host: g4d002.3-net
PID:  281

This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: A process refused to die!

Host: g4d002.3-net
PID:  281

This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
[g4d002.3-net:00279] sess_dir_finalize: found proc session dir empty - deleting [g4d002.3-net:00279] sess_dir_finalize: found job session dir empty - deleting [g4d002.3-net:00279] sess_dir_finalize: found univ session dir empty - deleting [g4d002.3-net:00279] sess_dir_finalize: found top session dir empty - deleting [powerbook.2-net:00607] sess_dir_finalize: found job session dir empty - deleting [powerbook.2-net:00607] sess_dir_finalize: univ session dir not empty - leaving [powerbook.2-net:00607] sess_dir_finalize: proc session dir not empty - leaving
--------------------------------------------------------------------------
WARNING: A process refused to die!

Host: powerbook.2-net
PID:  613

This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: A process refused to die!

Host: powerbook.2-net
PID:  613

This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
[powerbook.2-net:00607] sess_dir_finalize: found proc session dir empty - deleting [powerbook.2-net:00607] sess_dir_finalize: job session dir not empty - leaving [g4d001.3-net:00336] sess_dir_finalize: found job session dir empty - deleting
[g4d001.3-net:00336] sess_dir_finalize: univ session dir not empty - leaving
[g4d001.3-net:00336] sess_dir_finalize: proc session dir not empty - leaving
--------------------------------------------------------------------------
WARNING: A process refused to die!

Host: g4d001.3-net
PID:  338

This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: A process refused to die!

Host: g4d001.3-net
PID:  338

This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
[g4d001.3-net:00336] sess_dir_finalize: found proc session dir empty - deleting [g4d001.3-net:00336] sess_dir_finalize: found job session dir empty - deleting [g4d001.3-net:00336] sess_dir_finalize: found univ session dir empty - deleting [g4d001.3-net:00336] sess_dir_finalize: found top session dir empty - deleting [g4d003.3-net:00411] sess_dir_finalize: found job session dir empty - deleting
[g4d003.3-net:00411] sess_dir_finalize: univ session dir not empty - leaving
[g4d003.3-net:00411] sess_dir_finalize: proc session dir not empty - leaving
--------------------------------------------------------------------------
WARNING: A process refused to die!

Host: g4d003.3-net
PID:  413

This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: A process refused to die!

Host: g4d003.3-net
PID:  413

This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
1 process killed (possibly by Open MPI)
[g4d003.3-net:00411] orted: job_state_callback(jobid = 1, state = ORTE_PROC_STATE_TERMINATED) [g4d003.3-net:00411] sess_dir_finalize: found proc session dir empty - deleting [g4d003.3-net:00411] sess_dir_finalize: found job session dir empty - deleting [g4d003.3-net:00411] sess_dir_finalize: found univ session dir empty - deleting [g4d003.3-net:00411] sess_dir_finalize: found top session dir empty - deleting
[powerbook:/Network/CFD/hello] motte%

Reply via email to