I did a little more testing in case this helps... if I run ompi-server on the same host as the one I call MPI_Publish_name() on, it does successfully connect. But when I run it on a separate machine (which is on the same network and accessible via TCP), I get the issue above where it hangs.
Thanks for taking a look - if you'd like me to open a bug report for this one somewhere, just let me know. -Adam On Sun, Mar 19, 2017 at 2:46 PM, r...@open-mpi.org <r...@open-mpi.org> wrote: > Well, your initial usage looks correct - you don’t launch ompi-server via > mpirun. However, it sounds like there is probably a bug somewhere if it > hangs as you describe. > > Scratching my head, I can only recall less than a handful of people ever > using these MPI functions to cross-connect jobs, so it does tend to fall > into disrepair. As I said, I’ll try to repair it, at least for 3.0. > > > On Mar 19, 2017, at 4:37 AM, Adam Sylvester <op8...@gmail.com> wrote: > > I am trying to use ompi-server with Open MPI 1.10.6. I'm wondering if I > should run this with or without the mpirun command. If I run this: > > ompi-server --no-daemonize -r + > > It prints something such as 959315968.0;tcp://172.31.3.57:45743 to stdout > but I have thus far been unable to connect to it. That is, in another > application on another machine which is on the same network as the > ompi-server machine, I try > > MPI_Info info; > MPI_Info_create(&info); > MPI_Info_set(info, "ompi_global_scope", "true"); > > char myport[MPI_MAX_PORT_NAME]; > MPI_Open_port(MPI_INFO_NULL, myport); > MPI_Publish_name("adam-server", info, myport); > > But the MPI_Publish_name() function hangs forever when I run it like > > mpirun -np 1 --ompi-server "959315968.0;tcp://172.31.3.57:45743" server > > Blog posts are inconsistent as to if you should run ompi-server with > mpirun or not so I tried using it but this seg faults: > > mpirun -np 1 ompi-server --no-daemonize -r + > [ip-172-31-5-39:14785] *** Process received signal *** > [ip-172-31-5-39:14785] Signal: Segmentation fault (11) > [ip-172-31-5-39:14785] Signal code: Address not mapped (1) > [ip-172-31-5-39:14785] Failing at address: 0x6e0 > [ip-172-31-5-39:14785] [ 0] /lib64/libpthread.so.0(+ > 0xf370)[0x7f895d7a5370] > [ip-172-31-5-39:14785] [ 1] /usr/local/lib/libopen-pal.so. > 13(opal_hwloc191_hwloc_get_cpubind+0x9)[0x7f895e336839] > [ip-172-31-5-39:14785] [ 2] /usr/local/lib/libopen-rte.so. > 12(orte_ess_base_proc_binding+0x17a)[0x7f895e5d8fca] > [ip-172-31-5-39:14785] [ 3] /usr/local/lib/openmpi/mca_ > ess_env.so(+0x15dd)[0x7f895cdcd5dd] > [ip-172-31-5-39:14785] [ 4] /usr/local/lib/libopen-rte.so. > 12(orte_init+0x168)[0x7f895e5b5368] > [ip-172-31-5-39:14785] [ 5] ompi-server[0x4014d4] > [ip-172-31-5-39:14785] [ 6] /lib64/libc.so.6(__libc_start_ > main+0xf5)[0x7f895d3f6b35] > [ip-172-31-5-39:14785] [ 7] ompi-server[0x40176b] > [ip-172-31-5-39:14785] *** End of error message *** > > Am I doing something wrong? > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users >
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users