Re: [OMPI users] Can't connect using MPI Ports
>> The MPI Ports functionality (chapter 10.4 of MPI 3.1), mainly consisting of >> MPI_Open_port, MPI_Comm_accept and >> MPI_Comm_connect is not usuable without running an ompi-server as a third >> process? > > Yes, that’s correct. The reason for moving in that direction is that the > resource managers, as they continue to > integrate PMIx into them, are going to be providing that third party. This > will make connect/accept much easier to use, > and a great deal more scalable. > > See https://github.com/pmix/RFCs/blob/master/RFC0003.md for an explanation. Ok, thanks for that input. I haven't heard of pmix so far (only as part of some ompi error messages). Using ompi-server -d -r 'ompi.connect' I was able to publish and retrieve the port name, however, still no connection could be established. % mpirun -n 1 --ompi-server "file:ompi.connect" ./a.out A Published port 3044605953.0:664448538 % mpirun -n 1 --ompi-server "file:ompi.connect" ./a.out B Looked up port 3044605953.0:664448538 at this point, both processes hang. The code is: #include #include #include int main(int argc, char **argv) { MPI_Init(&argc, &argv); std::string a(argv[1]); char p[MPI_MAX_PORT_NAME]; MPI_Comm icomm; if (a == "A") { MPI_Open_port(MPI_INFO_NULL, p); MPI_Publish_name("foobar", MPI_INFO_NULL, p); printf("Published port %s\n", p); MPI_Comm_accept(p, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &icomm); } if (a == "B") { MPI_Lookup_name("foobar", MPI_INFO_NULL, p); printf("Looked up port %s\n", p); MPI_Comm_connect(p, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &icomm); } MPI_Finalize(); return 0; } Do you have any idea? Best, Florian ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] Can't connect using MPI Ports
I did a quick check across the v2.1 and v3.0 OMPI releases and both failed, though with different signatures. Looks like a problem in the OMPI dynamics integration (i.e., the PMIx library looked like it was doing the right things). I’d suggest filing an issue on the OMPI github site so someone can address it (I don’t work much on OMPI any more, I’m afraid). > On Nov 9, 2017, at 1:54 AM, Florian Lindner wrote: > >>> The MPI Ports functionality (chapter 10.4 of MPI 3.1), mainly consisting of >>> MPI_Open_port, MPI_Comm_accept and >>> MPI_Comm_connect is not usuable without running an ompi-server as a third >>> process? >> >> Yes, that’s correct. The reason for moving in that direction is that the >> resource managers, as they continue to >> integrate PMIx into them, are going to be providing that third party. This >> will make connect/accept much easier to use, >> and a great deal more scalable. >> >> See https://github.com/pmix/RFCs/blob/master/RFC0003.md for an explanation. > > > Ok, thanks for that input. I haven't heard of pmix so far (only as part of > some ompi error messages). > > Using ompi-server -d -r 'ompi.connect' I was able to publish and retrieve the > port name, however, still no connection > could be established. > > % mpirun -n 1 --ompi-server "file:ompi.connect" ./a.out A > Published port 3044605953.0:664448538 > > % mpirun -n 1 --ompi-server "file:ompi.connect" ./a.out B > Looked up port 3044605953.0:664448538 > > > at this point, both processes hang. > > The code is: > > #include > #include > #include > > int main(int argc, char **argv) > { > MPI_Init(&argc, &argv); > std::string a(argv[1]); > char p[MPI_MAX_PORT_NAME]; > MPI_Comm icomm; > > if (a == "A") { >MPI_Open_port(MPI_INFO_NULL, p); >MPI_Publish_name("foobar", MPI_INFO_NULL, p); >printf("Published port %s\n", p); >MPI_Comm_accept(p, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &icomm); > } > if (a == "B") { >MPI_Lookup_name("foobar", MPI_INFO_NULL, p); >printf("Looked up port %s\n", p); >MPI_Comm_connect(p, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &icomm); > } > > MPI_Finalize(); > > return 0; > } > > > > Do you have any idea? > > Best, > Florian > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
[OMPI users] usNIC BTL unrecognized payload type 255 when running under SLURM srun nut not mpiexec/mpirun
Hi everyone! We’re observing output such as the following when running non-trivial MPI software through SLURM’s srun [cn-11:52778] unrecognized payload type 255 [cn-11:52778] base = 0x9ce2c0, proto = 0x9ce2c0, hdr = 0x9ce300 [cn-11:52778]0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [cn-11:52778] 10: 00 00 00 00 00 00 06 02 ff 0c 1f c2 06 02 ff 0c [cn-11:52778] 20: b9 8f 08 00 45 00 00 3c 00 00 40 00 08 11 5d 5d [cn-11:52778] 30: 0a 95 00 16 0a 95 00 15 e5 05 e8 d9 00 28 7c 8c [cn-11:52778] 40: 01 00 00 00 00 00 31 b6 00 00 8f e3 00 00 00 00 [cn-11:52778] 50: 00 00 00 00 00 00 06 02 ff 0c d3 25 06 02 ff 0c [cn-11:52778] 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [cn-11:52778] 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 It is independent of the software BUT is NOT observable when running with mpiexec/mpirun. When switching to the TCP or vader BTL we have clean output and the message is not observed. It is output by different ranks on various nodes, so not reproducibly the same nodes. The location of the message seems to be from here[1] Any idea how to get rid of this or what might be the root cause? Hints what to check for would be greatly appreciated! TIA! Petar Environment: 1.4.0-cisco-1.0.531.1-RHEL7U3 SLURM 17.02.7 OpenMPI 2.0.2 configured with libfabric, usnic, SLURM, SLURM’s PMI library: ./configure --prefix=/software/171020/software/openmpi/2.0.2-gcc-6.3.0-2.27 --enable-shared --enable-mpi-thread-multiple --with-libfabric=/opt/cisco/libfabric --without-memory-manager --enable-mpirun-prefix-by-default --enable-mpirun-prefix-by-default --with-hwloc=$EBROOTHWLOC --with-usnic --with-verbs-usnic --with-slurm --with-pmi=/cm/shared/apps/slurm/current --enable-dlopen LDFLAGS="-Wl,-rpath -Wl,/opt/cisco/libfabric/lib -Wl,--enable-new-dtags" NIC UCSC-MLOM-C40Q-03 [VIC 1387] VIC Firmware 4.1(3a) [1] https://github.com/open-mpi/ompi/blob/9c3ae64297e034b30cb65298908014764216c616/opal/mca/btl/usnic/btl_usnic_recv.c#L354 ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users