Hi Eduardo,
The OFI MTL got some new features during 2018 that went into v4.0.0 but are not
backported to older OMPI versions.
What version of libfabric are you using and where are you installing it from?
I will try to reproduce your error. I'm running some quick tests and I see it
working:
/tmp >ompi_info
Package: Open MPI [email protected]
Distribution
Open MPI: 4.0.0rc5
Open MPI repo revision: v4.0.0
Open MPI release date: Unreleased developer copy
Open RTE: 4.0.0rc5
Open RTE repo revision: v4.0.0
Open RTE release date: Unreleased developer copy
OPAL: 4.0.0rc5
OPAL repo revision: v4.0.0
OPAL release date: Unreleased developer copy
MPI API: 3.1.0
Ident string: 4.0.0rc5
Prefix: /nfs/sc/disks/fabric_work/macabral/tmp/ompi-4.0.0
Configured architecture: x86_64-unknown-linux-gnu
Configure host: sperf-41.sc.intel.com
Configured by: macabral
Configured on: Fri Jan 11 17:42:06 EST 2019
Configure host: sperf-41.sc.intel.com
Configure command line: '--with-ofi' '--with-verbs=no'
'--prefix=/tmp/ompi-4.0.0'
....
/tmp> rpm -qi libfabric
Name : libfabric
Version : 1.6.0
Release : 80
Architecture: x86_64
Install Date: Wed 19 Dec 2018 05:45:41 PM EST
Group : System Environment/Libraries
Size : 10131964
License : GPLv2 or BSD
Signature : (none)
Source RPM : libfabric-1.6.0-80.src.rpm
Build Date : Wed 22 Aug 2018 11:08:29 PM EDT
Build Host : ph-bld-node-27.ph.intel.com
Relocations : (not relocatable)
URL : http://www.github.com/ofiwg/libfabric
Summary : User-space RDMA Fabric Interfaces
Description :
libfabric provides a user-space API to access high-performance fabric
services, such as RDMA.
/tmp> mpirun -np 2 -mca mtl ofi -mca pml cm ./a
Hello World from proccess 0 out of 2
This is process 0 reporting::
Hello World from proccess 1 out of 2
Process 1 received number 10 from process 0
From: users [mailto:[email protected]] On Behalf Of ROTHE
Eduardo - externe
Sent: Thursday, January 10, 2019 10:02 AM
To: Open MPI Users <[email protected]>
Subject: Re: [OMPI users] Open MPI 4.0.0 - error with MPI_Send
Hi Gilles, thank you so much once again!
I have a success using directly the psm2 mtl. Indeed, I do not need to use the
cm pml (I guess this might be because the cm pml gets automatically selected
when I enforce the psm2 mtl?). So both the following two commands execute
successfully with Open MPI 4.0.0:
> mpirun --mca pml cm --mca mtl psm2 -np 2 ./a.out
> mpirun --mca mtl psm2 -np 2 ./a.out
The error persists using libfabric. The following command returns the MPI_Send
error:
> mpirun --mca pml cm --mca mtl ofi -np 2 ./a.out
It seems the problem sits between libfabric and Open MPI 4.0.0 (remember, I
don't see the same behaviour with Open MPI 3.1.3). So I guess if I want to use
libfabric I will have to dig a bit more regarding the interface between this
library and Open MPI 4.0.0. Any lines of thought on how to start here would be
(very!) appreciated.
If you have any scheme that would help me to understand the framework/modules
architecture and why some modules are automatically selected (like the above
case), I would be very pleased (even more!?:).
Regards,
Eduardo
________________________________
De : users
<[email protected]<mailto:[email protected]>> de
la part de [email protected]<mailto:[email protected]>
<[email protected]<mailto:[email protected]>>
Envoyé : jeudi 10 janvier 2019 13:51
À : Open MPI Users
Objet : Re: [OMPI users] Open MPI 4.0.0 - error with MPI_Send
Eduardo,
You have two options to use OmniPath
- "directly" via the psm2 mtl
mpirun -mca pml cm -mca mtl psm2 ...
- "indirectly" via libfabric
mpirun -mca pml cm -mca mtl ofi ...
I do invite you to try both. By explicitly requesting the mtl you will avoid
potential conflicts.
libfabric is used in production by Cisco and AWS (both major contributors to
both Open MPI and libfabric) so this is clearly not something to stay away
from. That being said, bug always happen and they could be related to Open MPI,
libfabric and/or OmniPath (and fwiw, Intel is a major contributor to libfabric
too)
Cheers,
Gilles
On Thursday, January 10, 2019, ROTHE Eduardo - externe
<[email protected]<mailto:[email protected]>> wrote:
Hi Gilles, thank you so much for your support!
For now I'm just testing the software, so it's running on a single node.
Your suggestion was very precise. In fact, choosing the ob1 component leads to
a successfull execution! The tcp component had no effect.
mpirun --mca pml ob1 -mca btl tcp,self -np 2 ./a.out > Success
mpirun --mca pml ob1 -np 2 ./a.out > Success
But... our cluster is equiped with Intel OMNI Path interconnects and we are
aiming to use psm2 through ofi component in order to take full advantage of
this technology.
I believe your suggestion is showing that the problem is right here. But
unfortunately I cannot see further.
Meanwhile, I've also compiled Open MPI 3.1.3 and I have a successfull run with
the same options and the same environment (no MPI_Send error). Could Open MPI
4.0.0 bring a different behaviour in this area? Eventually regarding ofi
component?
Do you have any idea that I could put in practice to narrow the problem further?
Regards,
Eduardo
ps: I've recompiled Open MPI 4.0.0 using --with-hwloc=external, but with no
different results (the same MPI_Send error);
ps2: Yes, the configure line thing is really fishy, the original line was
--prefix=/opt/openmpi/4.0.0 --with-pmix=/usr/lib/x86_64-linux-gnu/pmix
--with-libevent=external --with-slurm --enable-mpi-cxx --with-ofi
--with-verbs=no --disable-silent-rules --with-hwloc=/usr
--enable-mpirun-prefix-by-default --with-devel-headers
________________________________
De : users
<[email protected]<mailto:[email protected]>> de
la part de [email protected]<mailto:[email protected]>
<[email protected]<mailto:[email protected]>>
Envoyé : mercredi 9 janvier 2019 15:16
À : Open MPI Users
Objet : Re: [OMPI users] Open MPI 4.0.0 - error with MPI_Send
Eduardo,
The first part of the configure command line is for an install in /usr, but
then there is '-prefix=/opt/openmpi/4.0.0' and this is very fishy.
You should also use '-with-hwloc=external'.
How many nodes are you running on and which interconnect are you using ?
What if you
mpirun -mca pml ob1 -mca btl tcp,self -np 2 ./a.out
Cheers,
Gilles
On Wednesday, January 9, 2019, ROTHE Eduardo - externe
<[email protected]<mailto:[email protected]>> wrote:
Hi.
I'm testing Open MPI 4.0.0 and I'm struggling with a weird behaviour. In a very
simple example (very frustrating). I'm having the following error returned by
MPI_Send:
[gafront4:25692] *** An error occurred in MPI_Send
[gafront4:25692] *** reported by process [3152019457,0]
[gafront4:25692] *** on communicator MPI_COMM_WORLD
[gafront4:25692] *** MPI_ERR_OTHER: known error not in list
[gafront4:25692] *** MPI_ERRORS_ARE_FATAL (processes in this
communicator will now abort,
[gafront4:25692] *** and potentially your MPI job)
In the same machine I have other two instalations of Open MPI (2.0.2 and 2.1.2)
and they all run successfully this dummy program:
#include <stdio.h>
#include <mpi.h>
int main(int argc, char **argv) {
int process;
int population;
MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &process);
MPI_Comm_size(MPI_COMM_WORLD, &population);
printf("Hello World from proccess %d out of %d\n", process, population);
int send_number = 10;
int recv_number;
if (process == 0) {
MPI_Send(&send_number, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
printf("This is process 0 reporting::\n");
} else if (process == 1) {
MPI_Recv(&recv_number, 1, MPI_INT, 0, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
printf("Process 1 received number %d from process 0\n",
recv_number);
}
MPI_Finalize();
return 0;
}
I'm really upset about recurring to you with this problem. I've been arround it
for days now and can't find any good solution. Can you please take a look? I've
enabled FI_LOG_LEVEL=Debug to see if I can trap any information that could be
of use but unfortunetly with no success. I've also googled a lot, but I don't
see where this error message might be pointing at. Specially having two other
working versions on the same machine. The thing is that I see no reason why
this code shouldn't run.
The following is the configure command line, as given by ompi_info.
Configure command line: '--build=x86_64-linux-gnu' '--prefix=/usr'
'--includedir=${prefix}/include'
'--mandir=${prefix}/share/man'
'--infodir=${prefix}/share/info'
'--sysconfdir=/etc' '--localstatedir=/var'
'--disable-silent-rules'
'--libdir=${prefix}/lib/x86_64-linux-gnu'
'--libexecdir=${prefix}/lib/x86_64-linux-gnu'
'--disable-maintainer-mode'
'--disable-dependency-tracking'
'--prefix=/opt/openmpi/4.0.0'
'--with-pmix=/usr/lib/x86_64-linux-gnu/pmix'
'--with-libevent=external' '--with-slurm'
'--enable-mpi-cxx' '--with-ofi' '--with-verbs=no'
'--disable-silent-rules' '--with-hwloc=/usr'
'--enable-mpirun-prefix-by-default'
'--with-devel-headers'
Thank you for your time.
Regards,
Ed
Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à
l'intention exclusive des destinataires et les informations qui y figurent sont
strictement confidentielles. Toute utilisation de ce Message non conforme à sa
destination, toute diffusion ou toute publication totale ou partielle, est
interdite sauf autorisation expresse.
Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le
copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si
vous avez reçu ce Message par erreur, merci de le supprimer de votre système,
ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support
que ce soit. Nous vous remercions également d'en avertir immédiatement
l'expéditeur par retour du message.
Il est impossible de garantir que les communications par messagerie
électronique arrivent en temps utile, sont sécurisées ou dénuées de toute
erreur ou virus.
____________________________________________________
This message and any attachments (the 'Message') are intended solely for the
addressees. The information contained in this Message is confidential. Any use
of information contained in this Message not in accord with its purpose, any
dissemination or disclosure, either whole or partial, is prohibited except
formal approval.
If you are not the addressee, you may not copy, forward, disclose or use any
part of it. If you have received this message in error, please delete it and
all copies from your system and notify the sender immediately by return message.
E-mail communication cannot be guaranteed to be timely secure, error or
virus-free.
Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à
l'intention exclusive des destinataires et les informations qui y figurent sont
strictement confidentielles. Toute utilisation de ce Message non conforme à sa
destination, toute diffusion ou toute publication totale ou partielle, est
interdite sauf autorisation expresse.
Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le
copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si
vous avez reçu ce Message par erreur, merci de le supprimer de votre système,
ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support
que ce soit. Nous vous remercions également d'en avertir immédiatement
l'expéditeur par retour du message.
Il est impossible de garantir que les communications par messagerie
électronique arrivent en temps utile, sont sécurisées ou dénuées de toute
erreur ou virus.
____________________________________________________
This message and any attachments (the 'Message') are intended solely for the
addressees. The information contained in this Message is confidential. Any use
of information contained in this Message not in accord with its purpose, any
dissemination or disclosure, either whole or partial, is prohibited except
formal approval.
If you are not the addressee, you may not copy, forward, disclose or use any
part of it. If you have received this message in error, please delete it and
all copies from your system and notify the sender immediately by return message.
E-mail communication cannot be guaranteed to be timely secure, error or
virus-free.
Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à
l'intention exclusive des destinataires et les informations qui y figurent sont
strictement confidentielles. Toute utilisation de ce Message non conforme à sa
destination, toute diffusion ou toute publication totale ou partielle, est
interdite sauf autorisation expresse.
Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le
copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si
vous avez reçu ce Message par erreur, merci de le supprimer de votre système,
ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support
que ce soit. Nous vous remercions également d'en avertir immédiatement
l'expéditeur par retour du message.
Il est impossible de garantir que les communications par messagerie
électronique arrivent en temps utile, sont sécurisées ou dénuées de toute
erreur ou virus.
____________________________________________________
This message and any attachments (the 'Message') are intended solely for the
addressees. The information contained in this Message is confidential. Any use
of information contained in this Message not in accord with its purpose, any
dissemination or disclosure, either whole or partial, is prohibited except
formal approval.
If you are not the addressee, you may not copy, forward, disclose or use any
part of it. If you have received this message in error, please delete it and
all copies from your system and notify the sender immediately by return message.
E-mail communication cannot be guaranteed to be timely secure, error or
virus-free.
_______________________________________________
users mailing list
[email protected]
https://lists.open-mpi.org/mailman/listinfo/users