Can you send the full verbose output with "--mca btl_base_verbose 100"?


> On Jul 4, 2018, at 4:36 PM, carlos aguni <aguni...@gmail.com> wrote:
> 
> Hi Gilles. 
> 
> Thank you for your reply! :)
> I'm now using a compiled version of OpenMPI 3.0.2 and all seems to work fine 
> now.
> Running `mpirun -n 3 -host c01,c02,c03 hostname` i get:
> c01
> c02
> c03
> 
> `mpirun -n 2 -host c01,c02 hostname`:
> c02
> c01
> 
> `mpirun -n 2 -host c01,c03 hostname`:
> c01
> c03
> 
> Which is expected.
> 
> Now when I run a MPI_Spawn it prints out a warning message which refers to it 
> getting the wrong IP.
> Check the command. I'll highlight some verbose.
> `mpirun -n 1 --machinefile con_c03_hostfile --mca oob_base_verbose 10 
> con_c03`:
> Hello world from processor c01, rank 0 out of 2 processors
> Im the spawned rank 0
> Hello world from processor c03, rank 1 out of 2 processors
> [[35996,2],0][btl_tcp_endpoint.c:755:mca_btl_tcp_endpoint_start_connect] from 
> c03 to: c01 Unable to connect to the peer 10.0.0.1 on port 1024: Network is 
> unreachable
> 
> [c03:06355] pml_ob1_sendreq.c:235 FATAL
> 
> Verbose below:
> [c01:05462] [[36010,0],0] oob:tcp:init adding 10.0.0.1 to our list of V4 
> connections
> [c01:05462] [[36010,0],0] oob:tcp:init adding 172.16.0.1 to our list of V4 
> connections
> [c01:05462] [[36010,0],0] oob:tcp:init adding 172.21.1.136 to our list of V4 
> connections
> [c03:06225] [[36010,0],1] oob:tcp:init adding 192.168.0.1 to our list of V4 
> connections
> [c03:06225] [[36010,0],1] oob:tcp:init adding 172.16.0.2 to our list of V4 
> connections
> 
> Is there a way to suppress it?
> 
> My env is as described below:
> c01
> ens8 10.0.0.1/24
> ens9 172.16.0.1/24
> eth0 172.21.1.136/24
> 
> c02
> eth0 10.0.0.2/24
> 
> c03
> ens8 192.168.0.1/24
> eth1 172.16.0.2/24
> 
> c04
> eth0 192.168.0.2/24
> 
> Regards,
> Carlos.
> 
> On Sun, Jul 1, 2018 at 9:01 PM, Gilles Gouaillardet <gil...@rist.or.jp> wrote:
> Carlos,
> 
> 
> Open MPI 3.0.2 has been released, and it contains several bug fixes, so I do
> 
> encourage you to upgrade and try again.
> 
> 
> 
> if it still does not work, can you please run
> 
> mpirun --mca oob_base_verbose 10 ...
> 
> and then compress and post the output ?
> 
> 
> out of curiosity, would
> 
> mpirun --mca routed_radix 1 ...
> 
> work in your environment ?
> 
> 
> once we can analyze the logs, we should be able to figure out what is going 
> wrong.
> 
> 
> Cheers,
> 
> Gilles
> 
> On 6/29/2018 4:10 AM, carlos aguni wrote:
> Just realized my email wasn't sent to the archive.
> 
> On Sat, Jun 23, 2018 at 5:34 PM, carlos aguni <aguni...@gmail.com 
> <mailto:aguni...@gmail.com>> wrote:
> 
>     Hi!
> 
>     Thank you all for your reply Jeff, Gilles and rhc.
> 
>     Thank you Jeff and rhc for clarifying to me some of the openmpi's
>     internals.
> 
>     >> FWIW: we never send interface names to other hosts - just dot
>     addresses
>     > Should have clarified - when you specify an interface name for the
>     MCA param, then it is the interface name that is transferred as
>     that is the value of the MCA param. However, once we determine our
>     address, we only transfer dot addresses between ourselves
> 
>     If only dot addresses are sent to the hosts then why doesn't
>     openmpi use the default route like `ip route get <other host IP>`
>     instead of choosing a random one? Is it an expected behaviour? Can
>     it be changed?
> 
>     Sorry. As Gilles pointed out I forgot to mention which openmpi
>     version I was using. I'm using openmpi 3.0.0 gcc 7.3.0 from
>     openhpc. Centos 7.5.
> 
>     > mpirun—mca oob_tcp_if_exclude192.168.100.0/24
>     <http://192.168.100.0/24>...
> 
>     I cannot just exclude that interface cause after that I want to
>     add another computer that's on a different network. And this is
>     where things get messy :( I cannot just include and exclude
>     networks cause I have different machines on different networks.
>     This is what I want to achieve:
> 
> 
>         
> 
>     compute01
> 
>         
> 
>     compute02
> 
>         
> 
>     compute03
> 
>     ens3
> 
>         
> 
>     192.168.100.104/24 <http://192.168.100.104/24>
> 
>         
> 
>     10.0.0.227/24 <http://10.0.0.227/24>
> 
>         
> 
>     192.168.100.105/24 <http://192.168.100.105/24>
> 
>     ens8
> 
>         
> 
>     10.0.0.228/24 <http://10.0.0.228/24>
> 
>         
> 
>     172.21.1.128/24 <http://172.21.1.128/24>
> 
>         
> 
>     ---
> 
>     ens9
> 
>         
> 
>     172.21.1.155/24 <http://172.21.1.155/24>
> 
>         
> 
>     ---
> 
>         
> 
>     ---
> 
> 
>     So I'm in compute01 MPI_spawning another process on compute02 and
>     compute03.
>     With both MPI_Spawn and `mpirun -n 3 -host
>     compute01,compute02,compute03 hostname`
> 
>     Then when I include the mca parameters I get this:
>     `mpirun --oversubscribe --allow-run-as-root -n 3 --mca
>     oob_tcp_if_include 10.0.0.0/24,192.168.100.0/24
>     <http://10.0.0.0/24,192.168.100.0/24> -host
>     compute01,compute02,compute03 hostname`
>     WARNING: An invalid value was given for oob_tcp_if_include. This
>     value will be ignored.
>     ...
>     Message:    Did not find interface matching this subnet
> 
>     This would all work if it were to use the system's internals like
>     `ip route`.
> 
>     Best regards,
>     Carlos.
> 
> 
> 
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users


-- 
Jeff Squyres
jsquy...@cisco.com

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to