I experiencing many different errors with openmpi version 2.1.1. I have
had a suspicion that this might be related to the way the servers were
connected and configured. Regardless below is a diagram of how the
server are configured.
__ _
[__]|=|
/::/|_|
HOST: smd
Dual 1Gb Ethernet Bonded
.-------------> Bond0 IP: 192.168.1.200
| Infiniband Card: MHQH29B-XTR <------------.
| Ib0 IP: 10.1.0.1 |
| OS: Ubuntu Mate |
| __ _ |
| [__]|=| |
| /::/|_| |
| HOST: sm1 |
| Dual 1Gb Ethernet Bonded |
|-------------> Bond0 IP: 192.168.1.196 |
| Infiniband Card: QLOGIC QLE7340 <---------|
| Ib0 IP: 10.1.0.2 |
| OS: Centos 7 Minimal |
| __ _ |
| [__]|=| |
|---------. /::/|_| |
| | HOST: sm2 |
| | Dual 1Gb Ethernet Bonded |
| '---> Bond0 IP: 192.168.1.199 |
__________ Infiniband Card: QLOGIC QLE7340 __________
[_|||||||_°] Ib0 IP: 10.1.0.3 [_|||||||_°]
[_|||||||_°] OS: Centos 7 Minimal [_|||||||_°]
[_|||||||_°] __ _ [_|||||||_°]
Gb Ethernet Switch [__]|=| Voltaire 4036 QDR Switch
| /::/|_| |
| HOST: sm3 |
| Dual 1Gb Ethernet Bonded |
|-------------> Bond0 IP: 192.168.1.203 |
| Infiniband Card: QLOGIC QLE7340 <----------|
| Ib0 IP: 10.1.0.4 |
| OS: Centos 7 Minimal |
| __ _ |
| [__]|=| |
| /::/|_| |
| HOST: sm4 |
| Dual 1Gb Ethernet Bonded |
|-------------> Bond0 IP: 192.168.1.204 |
| Infiniband Card: QLOGIC QLE7340 <----------|
| Ib0 IP: 10.1.0.5 |
| OS: Centos 7 Minimal |
| __ _ |
| [__]|=| |
| /::/|_| |
| HOST: dl580 |
| Dual 1Gb Ethernet Bonded |
'-------------> Bond0 IP: 192.168.1.201 |
Infiniband Card: QLOGIC QLE7340 <----------'
Ib0 IP: 10.1.0.6
OS: Centos 7 Minimal
I have ensured that the Infiniband adapters can ping each other and
every node can passwordless ssh into every other node. Every node has
the same /etc/hosts file,
cat /etc/hosts
127.0.0.1 localhost
192.168.1.200 smd
192.168.1.196 sm1
192.168.1.199 sm2
192.168.1.203 sm3
192.168.1.204 sm4
192.168.1.201 dl580
10.1.0.1 smd-ib
10.1.0.2 sm1-ib
10.1.0.3 sm2-ib
10.1.0.4 sm3-ib
10.1.0.5 sm4-ib
10.1.0.6 dl580-ib
I have been using a simple ring test program to test openmpi. The code
for this program is attached.
The hostfile used in all the commands is,
cat ./nodes
smd slots=2
sm1 slots=2
sm2 slots=2
sm3 slots=2
sm4 slots=2
dl580 slots=2
When running the following command on smd,
mpirun -mca btl openib,self -np 2 --hostfile nodes ./ring
I obtain the following error,
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
Local host: sm1
Remote host: 192.168.1.200
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: smd
Local device: mlx4_0
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
Process 1 received token -1 from process 0
Process 0 received token -1 from process 1
[smd:12800] 1 more process has sent help message
help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[smd:12800] Set MCA parameter "orte_base_help_aggregate" to 0 to see all
help / error messages\
When increasing the number of processors no program output is produced.
mpirun -mca btl openib,self -np 4 --hostfile nodes ./ring
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
Local host: sm2
Remote host: 192.168.1.200
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened. This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded). Note that
Open MPI stopped checking at the first component that it did not find.
Host: sm1.overst.local
Framework: btl
Component: openib
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
mca_bml_base_open() failed
--> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: smd
Local device: mlx4_0
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
[smd:12953] 1 more process has sent help message help-mca-base.txt /
find-available:not-valid
[smd:12953] Set MCA parameter "orte_base_help_aggregate" to 0 to see all
help / error messages
[smd:12953] 1 more process has sent help message help-mpi-runtime.txt /
mpi_init:startup:internal-failure
[smd:12953] 1 more process has sent help message
help-mpi-btl-openib-cpc-base.txt / no cpcs for port
Running mpirun from other nodes does not resolve the issue. I have
checked that none of the nodes is running a firewall that would be
blocking tcp connections.
The error with the mlx4_0 adapter is expected as this is used as an 10Gb
Ethernet adapter to another network. The infiniband adapter on smd that
is being used for QDR infiniband is mlx4_1.
Any help would be appreciated.
Sincerely,
Allan Overstreet
// Author: Wes Kendall
// Copyright 2011 www.mpitutorial.com
// This code is provided freely with the tutorials on mpitutorial.com. Feel
// free to modify it for your own use. Any distribution of the code must
// either provide a link to www.mpitutorial.com or keep this header intact.
//
// Example using MPI_Send and MPI_Recv to pass a message around in a ring.
//
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char** argv) {
// Initialize the MPI environment
MPI_Init(NULL, NULL);
// Find out rank, size
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
int token;
// Receive from the lower process and send to the higher process. Take care
// of the special case when you are the first process to prevent deadlock.
if (world_rank != 0) {
MPI_Recv(&token, 1, MPI_INT, world_rank - 1, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n", world_rank, token,
world_rank - 1);
} else {
// Set the token's value if you are process 0
token = -1;
}
MPI_Send(&token, 1, MPI_INT, (world_rank + 1) % world_size, 0,
MPI_COMM_WORLD);
// Now process 0 can receive from the last process. This makes sure that at
// least one MPI_Send is initialized before all MPI_Recvs (again, to prevent
// deadlock)
if (world_rank == 0) {
MPI_Recv(&token, 1, MPI_INT, world_size - 1, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n", world_rank, token,
world_size - 1);
}
MPI_Finalize();
}
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users