users-requ...@open-mpi.org wrote:
A few clarifying questions:
What is your netmask on these hosts?
Where is the MPI_ALLREDUCE in your app -- right away, or somewhere deep
within the application? Can you replicate this with a simple MPI
application that essentially calls MPI_INIT, MPI_ALLREDUCE, and
MPI_FINALIZE?
Can you replicate this with a simple MPI app that does an MPI_SEND /
MPI_RECV between two processes on the different subnets?
Thanks.
@ Jeff,
netmask 255.255.255.0
Running a simple "hello world" yields no error on each subnet, but
running "hello world" on both subnets yields the error
[g5dual.3-net:00436] *** An error occurred in MPI_Send
[g5dual.3-net:00436] *** on communicator MPI_COMM_WORLD
[g5dual.3-net:00436] *** MPI_ERR_INTERN: internal error
[g5dual.3-net:00436] *** MPI_ERRORS_ARE_FATAL (goodbye)
Hope this helps!
Frank
Just in case you wanna check the source:
c Fortran example hello_world
program hello
include 'mpif.h'
integer rank, size, ierror, tag, status(MPI_STATUS_SIZE)
character*12 message
call MPI_INIT(ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
tag = 100
if (rank .eq. 0) then
message = 'Hello, world'
do i=1, size-1
call MPI_SEND(message, 12, MPI_CHARACTER, i, tag,
& MPI_COMM_WORLD, ierror)
enddo
else
call MPI_RECV(message, 12, MPI_CHARACTER, 0, tag,
& MPI_COMM_WORLD, status, ierror)
endif
print*, 'node', rank, ':', message
call MPI_FINALIZE(ierror)
end
or the full output:
[powerbook:/Network/CFD/hello] motte% mpirun -d -np 5 --hostfile
./hostfile /Network/CFD/hello/hello_world
[powerbook.2-net:00606] [0,0,0] setting up session dir with
[powerbook.2-net:00606] universe default-universe
[powerbook.2-net:00606] user motte
[powerbook.2-net:00606] host powerbook.2-net
[powerbook.2-net:00606] jobid 0
[powerbook.2-net:00606] procid 0
[powerbook.2-net:00606] procdir:
/tmp/openmpi-sessions-motte@powerbook.2-net_0/default-universe/0/0
[powerbook.2-net:00606] jobdir:
/tmp/openmpi-sessions-motte@powerbook.2-net_0/default-universe/0
[powerbook.2-net:00606] unidir:
/tmp/openmpi-sessions-motte@powerbook.2-net_0/default-universe
[powerbook.2-net:00606] top: openmpi-sessions-motte@powerbook.2-net_0
[powerbook.2-net:00606] tmp: /tmp
[powerbook.2-net:00606] [0,0,0] contact_file
/tmp/openmpi-sessions-motte@powerbook.2-net_0/default-universe/universe-setup.txt
[powerbook.2-net:00606] [0,0,0] wrote setup file
[powerbook.2-net:00606] pls:rsh: local csh: 1, local bash: 0
[powerbook.2-net:00606] pls:rsh: assuming same remote shell as local shell
[powerbook.2-net:00606] pls:rsh: remote csh: 1, remote bash: 0
[powerbook.2-net:00606] pls:rsh: final template argv:
[powerbook.2-net:00606] pls:rsh: /usr/bin/ssh <template> orted
--debug --bootproxy 1 --name <template> --num_procs 6 --vpid_start 0
--nodename <template> --universe motte@powerbook.2-net:default-universe
--nsreplica "0.0.0;tcp://192.168.2.3:49443" --gprreplica
"0.0.0;tcp://192.168.2.3:49443" --mpi-call-yield 0
[powerbook.2-net:00606] pls:rsh: launching on node Powerbook.2-net
[powerbook.2-net:00606] pls:rsh: not oversubscribed -- setting
mpi_yield_when_idle to 0
[powerbook.2-net:00606] pls:rsh: Powerbook.2-net is a LOCAL node
[powerbook.2-net:00606] pls:rsh: changing to directory /Users/motte
[powerbook.2-net:00606] pls:rsh: executing: orted --debug --bootproxy 1
--name 0.0.1 --num_procs 6 --vpid_start 0 --nodename Powerbook.2-net
--universe motte@powerbook.2-net:default-universe --nsreplica
"0.0.0;tcp://192.168.2.3:49443" --gprreplica
"0.0.0;tcp://192.168.2.3:49443" --mpi-call-yield 0
[powerbook.2-net:00607] [0,0,1] setting up session dir with
[powerbook.2-net:00607] universe default-universe
[powerbook.2-net:00607] user motte
[powerbook.2-net:00607] host Powerbook.2-net
[powerbook.2-net:00607] jobid 0
[powerbook.2-net:00607] procid 1
[powerbook.2-net:00607] procdir:
/tmp/openmpi-sessions-motte@Powerbook.2-net_0/default-universe/0/1
[powerbook.2-net:00607] jobdir:
/tmp/openmpi-sessions-motte@Powerbook.2-net_0/default-universe/0
[powerbook.2-net:00607] unidir:
/tmp/openmpi-sessions-motte@Powerbook.2-net_0/default-universe
[powerbook.2-net:00607] top: openmpi-sessions-motte@Powerbook.2-net_0
[powerbook.2-net:00607] tmp: /tmp
[powerbook.2-net:00606] pls:rsh: launching on node g4d003.3-net
[powerbook.2-net:00606] pls:rsh: not oversubscribed -- setting
mpi_yield_when_idle to 0
[powerbook.2-net:00606] pls:rsh: g4d003.3-net is a REMOTE node
[powerbook.2-net:00606] pls:rsh: executing: /usr/bin/ssh g4d003.3-net
orted --debug --bootproxy 1 --name 0.0.2 --num_procs 6 --vpid_start 0
--nodename g4d003.3-net --universe
motte@powerbook.2-net:default-universe --nsreplica
"0.0.0;tcp://192.168.2.3:49443" --gprreplica
"0.0.0;tcp://192.168.2.3:49443" --mpi-call-yield 0
[g4d003.3-net:00411] [0,0,2] setting up session dir with
[g4d003.3-net:00411] universe default-universe
[g4d003.3-net:00411] user motte
[g4d003.3-net:00411] host g4d003.3-net
[g4d003.3-net:00411] jobid 0
[g4d003.3-net:00411] procid 2
[g4d003.3-net:00411] procdir:
/tmp/openmpi-sessions-motte@g4d003.3-net_0/default-universe/0/2
[g4d003.3-net:00411] jobdir:
/tmp/openmpi-sessions-motte@g4d003.3-net_0/default-universe/0
[g4d003.3-net:00411] unidir:
/tmp/openmpi-sessions-motte@g4d003.3-net_0/default-universe
[g4d003.3-net:00411] top: openmpi-sessions-motte@g4d003.3-net_0
[g4d003.3-net:00411] tmp: /tmp
[powerbook.2-net:00606] pls:rsh: launching on node g4d002.3-net
[powerbook.2-net:00606] pls:rsh: not oversubscribed -- setting
mpi_yield_when_idle to 0
[powerbook.2-net:00606] pls:rsh: g4d002.3-net is a REMOTE node
[powerbook.2-net:00606] pls:rsh: executing: /usr/bin/ssh g4d002.3-net
orted --debug --bootproxy 1 --name 0.0.3 --num_procs 6 --vpid_start 0
--nodename g4d002.3-net --universe
motte@powerbook.2-net:default-universe --nsreplica
"0.0.0;tcp://192.168.2.3:49443" --gprreplica
"0.0.0;tcp://192.168.2.3:49443" --mpi-call-yield 0
[powerbook.2-net:00606] pls:rsh: launching on node g4d001.3-net
[powerbook.2-net:00606] pls:rsh: not oversubscribed -- setting
mpi_yield_when_idle to 0
[powerbook.2-net:00606] pls:rsh: g4d001.3-net is a REMOTE node
[powerbook.2-net:00606] pls:rsh: executing: /usr/bin/ssh g4d001.3-net
orted --debug --bootproxy 1 --name 0.0.4 --num_procs 6 --vpid_start 0
--nodename g4d001.3-net --universe
motte@powerbook.2-net:default-universe --nsreplica
"0.0.0;tcp://192.168.2.3:49443" --gprreplica
"0.0.0;tcp://192.168.2.3:49443" --mpi-call-yield 0
[powerbook.2-net:00606] pls:rsh: launching on node G5Dual.3-net
[powerbook.2-net:00606] pls:rsh: not oversubscribed -- setting
mpi_yield_when_idle to 0
[powerbook.2-net:00606] pls:rsh: G5Dual.3-net is a REMOTE node
[powerbook.2-net:00606] pls:rsh: executing: /usr/bin/ssh G5Dual.3-net
orted --debug --bootproxy 1 --name 0.0.5 --num_procs 6 --vpid_start 0
--nodename G5Dual.3-net --universe
motte@powerbook.2-net:default-universe --nsreplica
"0.0.0;tcp://192.168.2.3:49443" --gprreplica
"0.0.0;tcp://192.168.2.3:49443" --mpi-call-yield 0
[g4d001.3-net:00336] [0,0,4] setting up session dir with
[g4d001.3-net:00336] universe default-universe
[g4d001.3-net:00336] user motte
[g4d001.3-net:00336] host g4d001.3-net
[g4d001.3-net:00336] jobid 0
[g4d001.3-net:00336] procid 4
[g4d001.3-net:00336] procdir:
/tmp/openmpi-sessions-motte@g4d001.3-net_0/default-universe/0/4
[g4d001.3-net:00336] jobdir:
/tmp/openmpi-sessions-motte@g4d001.3-net_0/default-universe/0
[g4d001.3-net:00336] unidir:
/tmp/openmpi-sessions-motte@g4d001.3-net_0/default-universe
[g4d001.3-net:00336] top: openmpi-sessions-motte@g4d001.3-net_0
[g4d001.3-net:00336] tmp: /tmp
[g4d002.3-net:00279] [0,0,3] setting up session dir with
[g4d002.3-net:00279] universe default-universe
[g4d002.3-net:00279] user motte
[g4d002.3-net:00279] host g4d002.3-net
[g4d002.3-net:00279] jobid 0
[g4d002.3-net:00279] procid 3
[g4d002.3-net:00279] procdir:
/tmp/openmpi-sessions-motte@g4d002.3-net_0/default-universe/0/3
[g4d002.3-net:00279] jobdir:
/tmp/openmpi-sessions-motte@g4d002.3-net_0/default-universe/0
[g4d002.3-net:00279] unidir:
/tmp/openmpi-sessions-motte@g4d002.3-net_0/default-universe
[g4d002.3-net:00279] top: openmpi-sessions-motte@g4d002.3-net_0
[g4d002.3-net:00279] tmp: /tmp
[g5dual.3-net:00434] [0,0,5] setting up session dir with
[g5dual.3-net:00434] universe default-universe
[g5dual.3-net:00434] user motte
[g5dual.3-net:00434] host G5Dual.3-net
[g5dual.3-net:00434] jobid 0
[g5dual.3-net:00434] procid 5
[g5dual.3-net:00434] procdir:
/tmp/openmpi-sessions-motte@G5Dual.3-net_0/default-universe/0/5
[g5dual.3-net:00434] jobdir:
/tmp/openmpi-sessions-motte@G5Dual.3-net_0/default-universe/0
[g5dual.3-net:00434] unidir:
/tmp/openmpi-sessions-motte@G5Dual.3-net_0/default-universe
[g5dual.3-net:00434] top: openmpi-sessions-motte@G5Dual.3-net_0
[g5dual.3-net:00434] tmp: /tmp
[powerbook.2-net:00613] [0,1,4] setting up session dir with
[powerbook.2-net:00613] universe default-universe
[powerbook.2-net:00613] user motte
[powerbook.2-net:00613] host Powerbook.2-net
[powerbook.2-net:00613] jobid 1
[powerbook.2-net:00613] procid 4
[powerbook.2-net:00613] procdir:
/tmp/openmpi-sessions-motte@Powerbook.2-net_0/default-universe/1/4
[powerbook.2-net:00613] jobdir:
/tmp/openmpi-sessions-motte@Powerbook.2-net_0/default-universe/1
[powerbook.2-net:00613] unidir:
/tmp/openmpi-sessions-motte@Powerbook.2-net_0/default-universe
[powerbook.2-net:00613] top: openmpi-sessions-motte@Powerbook.2-net_0
[powerbook.2-net:00613] tmp: /tmp
[g5dual.3-net:00436] [0,1,0] setting up session dir with
[g5dual.3-net:00436] universe default-universe
[g5dual.3-net:00436] user motte
[g5dual.3-net:00436] host G5Dual.3-net
[g5dual.3-net:00436] jobid 1
[g5dual.3-net:00436] procid 0
[g5dual.3-net:00436] procdir:
/tmp/openmpi-sessions-motte@G5Dual.3-net_0/default-universe/1/0
[g5dual.3-net:00436] jobdir:
/tmp/openmpi-sessions-motte@G5Dual.3-net_0/default-universe/1
[g5dual.3-net:00436] unidir:
/tmp/openmpi-sessions-motte@G5Dual.3-net_0/default-universe
[g5dual.3-net:00436] top: openmpi-sessions-motte@G5Dual.3-net_0
[g5dual.3-net:00436] tmp: /tmp
[g4d001.3-net:00338] [0,1,1] setting up session dir with
[g4d001.3-net:00338] universe default-universe
[g4d001.3-net:00338] user motte
[g4d001.3-net:00338] host g4d001.3-net
[g4d001.3-net:00338] jobid 1
[g4d001.3-net:00338] procid 1
[g4d001.3-net:00338] procdir:
/tmp/openmpi-sessions-motte@g4d001.3-net_0/default-universe/1/1
[g4d001.3-net:00338] jobdir:
/tmp/openmpi-sessions-motte@g4d001.3-net_0/default-universe/1
[g4d001.3-net:00338] unidir:
/tmp/openmpi-sessions-motte@g4d001.3-net_0/default-universe
[g4d001.3-net:00338] top: openmpi-sessions-motte@g4d001.3-net_0
[g4d001.3-net:00338] tmp: /tmp
[g4d003.3-net:00413] [0,1,3] setting up session dir with
[g4d003.3-net:00413] universe default-universe
[g4d003.3-net:00413] user motte
[g4d003.3-net:00413] host g4d003.3-net
[g4d003.3-net:00413] jobid 1
[g4d003.3-net:00413] procid 3
[g4d003.3-net:00413] procdir:
/tmp/openmpi-sessions-motte@g4d003.3-net_0/default-universe/1/3
[g4d003.3-net:00413] jobdir:
/tmp/openmpi-sessions-motte@g4d003.3-net_0/default-universe/1
[g4d003.3-net:00413] unidir:
/tmp/openmpi-sessions-motte@g4d003.3-net_0/default-universe
[g4d003.3-net:00413] top: openmpi-sessions-motte@g4d003.3-net_0
[g4d003.3-net:00413] tmp: /tmp
[g4d002.3-net:00281] [0,1,2] setting up session dir with
[g4d002.3-net:00281] universe default-universe
[g4d002.3-net:00281] user motte
[g4d002.3-net:00281] host g4d002.3-net
[g4d002.3-net:00281] jobid 1
[g4d002.3-net:00281] procid 2
[g4d002.3-net:00281] procdir:
/tmp/openmpi-sessions-motte@g4d002.3-net_0/default-universe/1/2
[g4d002.3-net:00281] jobdir:
/tmp/openmpi-sessions-motte@g4d002.3-net_0/default-universe/1
[g4d002.3-net:00281] unidir:
/tmp/openmpi-sessions-motte@g4d002.3-net_0/default-universe
[g4d002.3-net:00281] top: openmpi-sessions-motte@g4d002.3-net_0
[g4d002.3-net:00281] tmp: /tmp
[powerbook.2-net:00606] spawn: in job_state_callback(jobid = 1, state = 0x4)
[powerbook.2-net:00606] Info: Setting up debugger process table for
applications
MPIR_being_debugged = 0
MPIR_debug_gate = 0
MPIR_debug_state = 1
MPIR_acquired_pre_main = 0
MPIR_i_am_starter = 0
MPIR_proctable_size = 5
MPIR_proctable:
(i, host, exe, pid) = (0, G5Dual.3-net,
/Network/CFD/hello/hello_world, 436)
(i, host, exe, pid) = (1, g4d001.3-net,
/Network/CFD/hello/hello_world, 338)
(i, host, exe, pid) = (2, g4d002.3-net,
/Network/CFD/hello/hello_world, 281)
(i, host, exe, pid) = (3, g4d003.3-net,
/Network/CFD/hello/hello_world, 413)
(i, host, exe, pid) = (4, Powerbook.2-net,
/Network/CFD/hello/hello_world, 613)
[powerbook.2-net:00613] [0,1,4] ompi_mpi_init completed
[g4d001.3-net:00338] [0,1,1] ompi_mpi_init completed
[g5dual.3-net:00436] [0,1,0] ompi_mpi_init completed
[g4d003.3-net:00413] [0,1,3] ompi_mpi_init completed
[g4d002.3-net:00281] [0,1,2] ompi_mpi_init completed
node 1 :Hello, world
node 2 :Hello, world node 3 :Hello, world
[g5dual.3-net:00436] *** An error occurred in MPI_Send
[g5dual.3-net:00436] *** on communicator MPI_COMM_WORLD
[g5dual.3-net:00436] *** MPI_ERR_INTERN: internal error
[g5dual.3-net:00436] *** MPI_ERRORS_ARE_FATAL (goodbye)
--------------------------------------------------------------------------
WARNING: A process refused to die!
Host: powerbook.2-net
PID: 613
This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: A process refused to die!
Host: g4d003.3-net
PID: 413
This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: A process refused to die!
Host: g5dual.3-net
PID: 436
This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: A process refused to die!
Host: g4d002.3-net
PID: 281
This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: A process refused to die!
Host: g4d001.3-net
PID: 338
This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
[g5dual.3-net:00434] sess_dir_finalize: found proc session dir empty -
deleting
[g5dual.3-net:00434] sess_dir_finalize: found job session dir empty -
deleting
[g5dual.3-net:00434] sess_dir_finalize: univ session dir not empty - leaving
[powerbook.2-net:00607] orted: job_state_callback(jobid = 1, state =
ORTE_PROC_STATE_ABORTED)
[g5dual.3-net:00434] orted: job_state_callback(jobid = 1, state =
ORTE_PROC_STATE_ABORTED)
[g4d003.3-net:00411] orted: job_state_callback(jobid = 1, state =
ORTE_PROC_STATE_ABORTED)
[g4d001.3-net:00336] orted: job_state_callback(jobid = 1, state =
ORTE_PROC_STATE_ABORTED)
[g5dual.3-net:00434] sess_dir_finalize: job session dir not empty - leaving
[g5dual.3-net:00434] sess_dir_finalize: found proc session dir empty -
deleting
[g5dual.3-net:00434] sess_dir_finalize: found job session dir empty -
deleting
[g5dual.3-net:00434] sess_dir_finalize: found univ session dir empty -
deleting
[g5dual.3-net:00434] sess_dir_finalize: found top session dir empty -
deleting
[g4d002.3-net:00279] orted: job_state_callback(jobid = 1, state =
ORTE_PROC_STATE_ABORTED)
[g4d002.3-net:00279] sess_dir_finalize: found job session dir empty -
deleting
[g4d002.3-net:00279] sess_dir_finalize: univ session dir not empty - leaving
[g4d002.3-net:00279] sess_dir_finalize: proc session dir not empty - leaving
--------------------------------------------------------------------------
WARNING: A process refused to die!
Host: g4d002.3-net
PID: 281
This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: A process refused to die!
Host: g4d002.3-net
PID: 281
This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
[g4d002.3-net:00279] sess_dir_finalize: found proc session dir empty -
deleting
[g4d002.3-net:00279] sess_dir_finalize: found job session dir empty -
deleting
[g4d002.3-net:00279] sess_dir_finalize: found univ session dir empty -
deleting
[g4d002.3-net:00279] sess_dir_finalize: found top session dir empty -
deleting
[powerbook.2-net:00607] sess_dir_finalize: found job session dir empty -
deleting
[powerbook.2-net:00607] sess_dir_finalize: univ session dir not empty -
leaving
[powerbook.2-net:00607] sess_dir_finalize: proc session dir not empty -
leaving
--------------------------------------------------------------------------
WARNING: A process refused to die!
Host: powerbook.2-net
PID: 613
This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: A process refused to die!
Host: powerbook.2-net
PID: 613
This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
[powerbook.2-net:00607] sess_dir_finalize: found proc session dir empty
- deleting
[powerbook.2-net:00607] sess_dir_finalize: job session dir not empty -
leaving
[g4d001.3-net:00336] sess_dir_finalize: found job session dir empty -
deleting
[g4d001.3-net:00336] sess_dir_finalize: univ session dir not empty - leaving
[g4d001.3-net:00336] sess_dir_finalize: proc session dir not empty - leaving
--------------------------------------------------------------------------
WARNING: A process refused to die!
Host: g4d001.3-net
PID: 338
This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: A process refused to die!
Host: g4d001.3-net
PID: 338
This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
[g4d001.3-net:00336] sess_dir_finalize: found proc session dir empty -
deleting
[g4d001.3-net:00336] sess_dir_finalize: found job session dir empty -
deleting
[g4d001.3-net:00336] sess_dir_finalize: found univ session dir empty -
deleting
[g4d001.3-net:00336] sess_dir_finalize: found top session dir empty -
deleting
[g4d003.3-net:00411] sess_dir_finalize: found job session dir empty -
deleting
[g4d003.3-net:00411] sess_dir_finalize: univ session dir not empty - leaving
[g4d003.3-net:00411] sess_dir_finalize: proc session dir not empty - leaving
--------------------------------------------------------------------------
WARNING: A process refused to die!
Host: g4d003.3-net
PID: 413
This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: A process refused to die!
Host: g4d003.3-net
PID: 413
This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
1 process killed (possibly by Open MPI)
[g4d003.3-net:00411] orted: job_state_callback(jobid = 1, state =
ORTE_PROC_STATE_TERMINATED)
[g4d003.3-net:00411] sess_dir_finalize: found proc session dir empty -
deleting
[g4d003.3-net:00411] sess_dir_finalize: found job session dir empty -
deleting
[g4d003.3-net:00411] sess_dir_finalize: found univ session dir empty -
deleting
[g4d003.3-net:00411] sess_dir_finalize: found top session dir empty -
deleting
[powerbook:/Network/CFD/hello] motte%