Can you send the full verbose output with "--mca btl_base_verbose 100"?
> On Jul 4, 2018, at 4:36 PM, carlos aguni <aguni...@gmail.com> wrote: > > Hi Gilles. > > Thank you for your reply! :) > I'm now using a compiled version of OpenMPI 3.0.2 and all seems to work fine > now. > Running `mpirun -n 3 -host c01,c02,c03 hostname` i get: > c01 > c02 > c03 > > `mpirun -n 2 -host c01,c02 hostname`: > c02 > c01 > > `mpirun -n 2 -host c01,c03 hostname`: > c01 > c03 > > Which is expected. > > Now when I run a MPI_Spawn it prints out a warning message which refers to it > getting the wrong IP. > Check the command. I'll highlight some verbose. > `mpirun -n 1 --machinefile con_c03_hostfile --mca oob_base_verbose 10 > con_c03`: > Hello world from processor c01, rank 0 out of 2 processors > Im the spawned rank 0 > Hello world from processor c03, rank 1 out of 2 processors > [[35996,2],0][btl_tcp_endpoint.c:755:mca_btl_tcp_endpoint_start_connect] from > c03 to: c01 Unable to connect to the peer 10.0.0.1 on port 1024: Network is > unreachable > > [c03:06355] pml_ob1_sendreq.c:235 FATAL > > Verbose below: > [c01:05462] [[36010,0],0] oob:tcp:init adding 10.0.0.1 to our list of V4 > connections > [c01:05462] [[36010,0],0] oob:tcp:init adding 172.16.0.1 to our list of V4 > connections > [c01:05462] [[36010,0],0] oob:tcp:init adding 172.21.1.136 to our list of V4 > connections > [c03:06225] [[36010,0],1] oob:tcp:init adding 192.168.0.1 to our list of V4 > connections > [c03:06225] [[36010,0],1] oob:tcp:init adding 172.16.0.2 to our list of V4 > connections > > Is there a way to suppress it? > > My env is as described below: > c01 > ens8 10.0.0.1/24 > ens9 172.16.0.1/24 > eth0 172.21.1.136/24 > > c02 > eth0 10.0.0.2/24 > > c03 > ens8 192.168.0.1/24 > eth1 172.16.0.2/24 > > c04 > eth0 192.168.0.2/24 > > Regards, > Carlos. > > On Sun, Jul 1, 2018 at 9:01 PM, Gilles Gouaillardet <gil...@rist.or.jp> wrote: > Carlos, > > > Open MPI 3.0.2 has been released, and it contains several bug fixes, so I do > > encourage you to upgrade and try again. > > > > if it still does not work, can you please run > > mpirun --mca oob_base_verbose 10 ... > > and then compress and post the output ? > > > out of curiosity, would > > mpirun --mca routed_radix 1 ... > > work in your environment ? > > > once we can analyze the logs, we should be able to figure out what is going > wrong. > > > Cheers, > > Gilles > > On 6/29/2018 4:10 AM, carlos aguni wrote: > Just realized my email wasn't sent to the archive. > > On Sat, Jun 23, 2018 at 5:34 PM, carlos aguni <aguni...@gmail.com > <mailto:aguni...@gmail.com>> wrote: > > Hi! > > Thank you all for your reply Jeff, Gilles and rhc. > > Thank you Jeff and rhc for clarifying to me some of the openmpi's > internals. > > >> FWIW: we never send interface names to other hosts - just dot > addresses > > Should have clarified - when you specify an interface name for the > MCA param, then it is the interface name that is transferred as > that is the value of the MCA param. However, once we determine our > address, we only transfer dot addresses between ourselves > > If only dot addresses are sent to the hosts then why doesn't > openmpi use the default route like `ip route get <other host IP>` > instead of choosing a random one? Is it an expected behaviour? Can > it be changed? > > Sorry. As Gilles pointed out I forgot to mention which openmpi > version I was using. I'm using openmpi 3.0.0 gcc 7.3.0 from > openhpc. Centos 7.5. > > > mpirun—mca oob_tcp_if_exclude192.168.100.0/24 > <http://192.168.100.0/24>... > > I cannot just exclude that interface cause after that I want to > add another computer that's on a different network. And this is > where things get messy :( I cannot just include and exclude > networks cause I have different machines on different networks. > This is what I want to achieve: > > > > > compute01 > > > > compute02 > > > > compute03 > > ens3 > > > > 192.168.100.104/24 <http://192.168.100.104/24> > > > > 10.0.0.227/24 <http://10.0.0.227/24> > > > > 192.168.100.105/24 <http://192.168.100.105/24> > > ens8 > > > > 10.0.0.228/24 <http://10.0.0.228/24> > > > > 172.21.1.128/24 <http://172.21.1.128/24> > > > > --- > > ens9 > > > > 172.21.1.155/24 <http://172.21.1.155/24> > > > > --- > > > > --- > > > So I'm in compute01 MPI_spawning another process on compute02 and > compute03. > With both MPI_Spawn and `mpirun -n 3 -host > compute01,compute02,compute03 hostname` > > Then when I include the mca parameters I get this: > `mpirun --oversubscribe --allow-run-as-root -n 3 --mca > oob_tcp_if_include 10.0.0.0/24,192.168.100.0/24 > <http://10.0.0.0/24,192.168.100.0/24> -host > compute01,compute02,compute03 hostname` > WARNING: An invalid value was given for oob_tcp_if_include. This > value will be ignored. > ... > Message: Did not find interface matching this subnet > > This would all work if it were to use the system's internals like > `ip route`. > > Best regards, > Carlos. > > > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users -- Jeff Squyres jsquy...@cisco.com _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users