Re: [OMPI users] Can't connect using MPI Ports

2017-11-09 Thread Florian Lindner
>> The MPI Ports functionality (chapter 10.4 of MPI 3.1), mainly consisting of 
>> MPI_Open_port, MPI_Comm_accept and
>> MPI_Comm_connect is not usuable without running an ompi-server as a third 
>> process?
> 
> Yes, that’s correct. The reason for moving in that direction is that the 
> resource managers, as they continue to
> integrate PMIx into them, are going to be providing that third party. This 
> will make connect/accept much easier to use,
> and a great deal more scalable.
> 
> See https://github.com/pmix/RFCs/blob/master/RFC0003.md for an explanation.


Ok, thanks for that input. I haven't heard of pmix so far (only as part of some 
ompi error messages).

Using ompi-server -d -r 'ompi.connect' I was able to publish and retrieve the 
port name, however, still no connection
could be established.

% mpirun -n 1 --ompi-server "file:ompi.connect" ./a.out A
Published port 3044605953.0:664448538

% mpirun -n 1 --ompi-server "file:ompi.connect" ./a.out B
Looked up port 3044605953.0:664448538


at this point, both processes hang.

The code is:

#include 
#include 
#include 

int main(int argc, char **argv)
{
  MPI_Init(&argc, &argv);
  std::string a(argv[1]);
  char p[MPI_MAX_PORT_NAME];
  MPI_Comm icomm;

  if (a == "A") {
MPI_Open_port(MPI_INFO_NULL, p);
MPI_Publish_name("foobar", MPI_INFO_NULL, p);
printf("Published port %s\n", p);
MPI_Comm_accept(p, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &icomm);
  }
  if (a == "B") {
MPI_Lookup_name("foobar", MPI_INFO_NULL, p);
printf("Looked up port %s\n", p);
MPI_Comm_connect(p, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &icomm);
  }

  MPI_Finalize();

  return 0;
}



Do you have any idea?

Best,
Florian
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Can't connect using MPI Ports

2017-11-09 Thread r...@open-mpi.org
I did a quick check across the v2.1 and v3.0 OMPI releases and both failed, 
though with different signatures. Looks like a problem in the OMPI dynamics 
integration (i.e., the PMIx library looked like it was doing the right things).

I’d suggest filing an issue on the OMPI github site so someone can address it 
(I don’t work much on OMPI any more, I’m afraid).


> On Nov 9, 2017, at 1:54 AM, Florian Lindner  wrote:
> 
>>> The MPI Ports functionality (chapter 10.4 of MPI 3.1), mainly consisting of 
>>> MPI_Open_port, MPI_Comm_accept and
>>> MPI_Comm_connect is not usuable without running an ompi-server as a third 
>>> process?
>> 
>> Yes, that’s correct. The reason for moving in that direction is that the 
>> resource managers, as they continue to
>> integrate PMIx into them, are going to be providing that third party. This 
>> will make connect/accept much easier to use,
>> and a great deal more scalable.
>> 
>> See https://github.com/pmix/RFCs/blob/master/RFC0003.md for an explanation.
> 
> 
> Ok, thanks for that input. I haven't heard of pmix so far (only as part of 
> some ompi error messages).
> 
> Using ompi-server -d -r 'ompi.connect' I was able to publish and retrieve the 
> port name, however, still no connection
> could be established.
> 
> % mpirun -n 1 --ompi-server "file:ompi.connect" ./a.out A
> Published port 3044605953.0:664448538
> 
> % mpirun -n 1 --ompi-server "file:ompi.connect" ./a.out B
> Looked up port 3044605953.0:664448538
> 
> 
> at this point, both processes hang.
> 
> The code is:
> 
> #include 
> #include 
> #include 
> 
> int main(int argc, char **argv)
> {
>  MPI_Init(&argc, &argv);
>  std::string a(argv[1]);
>  char p[MPI_MAX_PORT_NAME];
>  MPI_Comm icomm;
> 
>  if (a == "A") {
>MPI_Open_port(MPI_INFO_NULL, p);
>MPI_Publish_name("foobar", MPI_INFO_NULL, p);
>printf("Published port %s\n", p);
>MPI_Comm_accept(p, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &icomm);
>  }
>  if (a == "B") {
>MPI_Lookup_name("foobar", MPI_INFO_NULL, p);
>printf("Looked up port %s\n", p);
>MPI_Comm_connect(p, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &icomm);
>  }
> 
>  MPI_Finalize();
> 
>  return 0;
> }
> 
> 
> 
> Do you have any idea?
> 
> Best,
> Florian
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] usNIC BTL unrecognized payload type 255 when running under SLURM srun nut not mpiexec/mpirun

2017-11-09 Thread Forai,Petar
Hi everyone!

We’re observing output such as the following when running non-trivial MPI 
software through  SLURM’s srun

 [cn-11:52778] unrecognized payload type 255
[cn-11:52778] base = 0x9ce2c0, proto = 0x9ce2c0, hdr = 0x9ce300
[cn-11:52778]0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-11:52778]   10: 00 00 00 00 00 00 06 02 ff 0c 1f c2 06 02 ff 0c
[cn-11:52778]   20: b9 8f 08 00 45 00 00 3c 00 00 40 00 08 11 5d 5d
[cn-11:52778]   30: 0a 95 00 16 0a 95 00 15 e5 05 e8 d9 00 28 7c 8c
[cn-11:52778]   40: 01 00 00 00 00 00 31 b6 00 00 8f e3 00 00 00 00
[cn-11:52778]   50: 00 00 00 00 00 00 06 02 ff 0c d3 25 06 02 ff 0c
[cn-11:52778]   60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-11:52778]   70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00


It is independent of the software BUT is NOT observable when running with 
mpiexec/mpirun. When switching to the TCP or vader BTL we have clean output and 
the message is not observed. It is output by different ranks on various nodes, 
so not reproducibly the same nodes.

The location of the message seems to be from here[1] 

Any idea how to get rid of this or what might be the root cause? Hints what to 
check for would be greatly appreciated! 
 
TIA!

Petar


Environment:
1.4.0-cisco-1.0.531.1-RHEL7U3
SLURM 17.02.7
OpenMPI 2.0.2 configured with libfabric, usnic, SLURM, SLURM’s PMI library:

./configure --prefix=/software/171020/software/openmpi/2.0.2-gcc-6.3.0-2.27 
--enable-shared --enable-mpi-thread-multiple 
--with-libfabric=/opt/cisco/libfabric --without-memory-manager 
--enable-mpirun-prefix-by-default --enable-mpirun-prefix-by-default 
--with-hwloc=$EBROOTHWLOC --with-usnic --with-verbs-usnic --with-slurm 
--with-pmi=/cm/shared/apps/slurm/current --enable-dlopen  LDFLAGS="-Wl,-rpath 
-Wl,/opt/cisco/libfabric/lib -Wl,--enable-new-dtags"

NIC  UCSC-MLOM-C40Q-03 [VIC 1387]
VIC Firmware  4.1(3a)


[1] 
https://github.com/open-mpi/ompi/blob/9c3ae64297e034b30cb65298908014764216c616/opal/mca/btl/usnic/btl_usnic_recv.c#L354



___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users