Hello Howard, Thank you for looking into this. Attached is the output you requested. Also, I am using Open MPI 2.0.1.
Thank you, Brendan From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Howard Pritchard Sent: Friday, January 20, 2017 6:35 PM To: Open MPI Users <users@lists.open-mpi.org> Subject: Re: [OMPI users] Open MPI over RoCE using breakout cable and switch Hi Brendan I doubt this kind of config has gotten any testing with OMPI. Could you rerun with --mca btl_base_verbose 100 added to the command line and post the output to the list? Howard Brendan Myers <brendan.my...@soft-forge.com <mailto:brendan.my...@soft-forge.com> > schrieb am Fr. 20. Jan. 2017 um 15:04: Hello, I am attempting to get Open MPI to run over 2 nodes using a switch and a single breakout cable with this design: (100GbE)QSFP <----> 2x (50GbE)QSFP Hardware Layout: Breakout cable module A connects to switch (100GbE) Breakout cable module B1 connects to node 1 RoCE NIC (50GbE) Breakout cable module B2 connects to node 2 RoCE NIC (50GbE) Switch is Mellanox SN 2700 100GbE RoCE switch * I am able to pass RDMA traffic between the nodes with perftest (ib_write_bw) when using the breakout cable as the IC from both nodes to the switch. * When attempting to run a job using the breakout cable as the IC Open MPI aborts with failure to initialize open fabrics device errors. * If I replace the breakout cable with 2 standard QSFP cables the Open MPI job will complete correctly. This is the command I use, it works unless I attempt a run with the breakout cable used as IC: mpirun --mca btl openib,self,sm --mca btl_openib_receive_queues P,65536,120,64,32 --mca btl_openib_cpc_include rdmacm -hostfile mpi-hosts-ce /usr/local/bin/IMB-MPI1 If anyone has any idea as to why using a breakout cable is causing my jobs to fail please let me know. Thank you, Brendan T. W. Myers brendan.my...@soft-forge.com <mailto:brendan.my...@soft-forge.com> Software Forge Inc _______________________________________________ users mailing list users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
[sm-node-8:13428] mca: base: components_register: registering framework btl components [sm-node-8:13428] mca: base: components_register: found loaded component self [sm-node-8:13428] mca: base: components_register: component self register function successful [sm-node-8:13428] mca: base: components_register: found loaded component openib [sm-node-7:28343] mca: base: components_register: registering framework btl components [sm-node-7:28343] mca: base: components_register: found loaded component self [sm-node-7:28343] mca: base: components_register: component self register function successful [sm-node-7:28343] mca: base: components_register: found loaded component openib [sm-node-8:13428] mca: base: components_register: component openib register function successful [sm-node-8:13428] mca: base: components_register: found loaded component sm [sm-node-8:13428] mca: base: components_register: component sm register function successful [sm-node-8:13428] mca: base: components_open: opening btl components [sm-node-8:13428] mca: base: components_open: found loaded component self [sm-node-8:13428] mca: base: components_open: component self open function successful [sm-node-8:13428] mca: base: components_open: found loaded component openib [sm-node-8:13428] mca: base: components_open: component openib open function successful [sm-node-8:13428] mca: base: components_open: found loaded component sm [sm-node-8:13428] mca: base: components_open: component sm open function successful [sm-node-8:13428] select: initializing btl component self [sm-node-7:28342] mca: base: components_register: registering framework btl components [sm-node-7:28342] mca: base: components_register: found loaded component self [sm-node-8:13429] mca: base: components_register: registering framework btl components [sm-node-8:13429] mca: base: components_register: found loaded component self [sm-node-8:13428] select: init of component self returned success [sm-node-8:13428] select: initializing btl component openib [sm-node-7:28343] mca: base: components_register: component openib register function successful [sm-node-7:28343] mca: base: components_register: found loaded component sm [sm-node-8:13429] mca: base: components_register: component self register function successful [sm-node-7:28342] mca: base: components_register: component self register function successful [sm-node-7:28342] mca: base: components_register: found loaded component openib [sm-node-8:13429] mca: base: components_register: found loaded component openib [sm-node-7:28343] mca: base: components_register: component sm register function successful [sm-node-7:28343] mca: base: components_open: opening btl components [sm-node-7:28343] mca: base: components_open: found loaded component self [sm-node-8:13430] mca: base: components_register: registering framework btl components [sm-node-8:13430] mca: base: components_register: found loaded component self [sm-node-7:28343] mca: base: components_open: component self open function successful [sm-node-7:28343] mca: base: components_open: found loaded component openib [sm-node-7:28343] mca: base: components_open: component openib open function successful [sm-node-7:28343] mca: base: components_open: found loaded component sm [sm-node-7:28343] mca: base: components_open: component sm open function successful [sm-node-7:28343] select: initializing btl component self [sm-node-7:28343] select: init of component self returned success [sm-node-7:28343] select: initializing btl component openib [sm-node-8:13427] mca: base: components_register: registering framework btl components [sm-node-8:13427] mca: base: components_register: found loaded component self [sm-node-8:13430] mca: base: components_register: component self register function successful [sm-node-8:13430] mca: base: components_register: found loaded component openib [sm-node-8:13428] Checking distance from this process to device=mlx5_1 [sm-node-8:13428] Process is bound: distance to device is 0.000000 [sm-node-8:13428] Checking distance from this process to device=mlx5_0 [sm-node-8:13428] Process is bound: distance to device is 0.000000 [sm-node-8:13427] mca: base: components_register: component self register function successful [sm-node-7:28343] Checking distance from this process to device=mlx5_1 [sm-node-7:28343] Process is bound: distance to device is 0.000000 [sm-node-8:13427] mca: base: components_register: found loaded component openib [sm-node-7:28343] Checking distance from this process to device=mlx5_0 [sm-node-7:28343] Process is bound: distance to device is 0.000000 [sm-node-7:28342] mca: base: components_register: component openib register function successful [sm-node-7:28342] mca: base: components_register: found loaded component sm [sm-node-8:13429] mca: base: components_register: component openib register function successful [sm-node-7:28342] mca: base: components_register: component sm register function successful [sm-node-8:13430] mca: base: components_register: component openib register function successful [sm-node-7:28342] mca: base: components_open: opening btl components [sm-node-7:28342] mca: base: components_open: found loaded component self [sm-node-7:28342] mca: base: components_open: component self open function successful [sm-node-7:28342] mca: base: components_open: found loaded component openib [sm-node-7:28342] mca: base: components_open: component openib open function successful [sm-node-7:28342] mca: base: components_open: found loaded component sm [sm-node-7:28342] mca: base: components_open: component sm open function successful [sm-node-7:28342] select: initializing btl component self [sm-node-7:28342] select: init of component self returned success [sm-node-7:28342] select: initializing btl component openib [sm-node-8:13429] mca: base: components_register: found loaded component sm [sm-node-8:13430] mca: base: components_register: found loaded component sm [sm-node-7:28344] mca: base: components_register: registering framework btl components [sm-node-7:28344] mca: base: components_register: found loaded component self [sm-node-8:13429] mca: base: components_register: component sm register function successful [sm-node-7:28345] mca: base: components_register: registering framework btl components [sm-node-7:28345] mca: base: components_register: found loaded component self [sm-node-8:13430] mca: base: components_register: component sm register function successful [sm-node-7:28344] mca: base: components_register: component self register function successful [sm-node-8:13429] mca: base: components_open: opening btl components [sm-node-8:13429] mca: base: components_open: found loaded component self [sm-node-8:13429] mca: base: components_open: component self open function successful [sm-node-8:13429] mca: base: components_open: found loaded component openib [sm-node-8:13429] mca: base: components_open: component openib open function successful [sm-node-8:13429] mca: base: components_open: found loaded component sm [sm-node-8:13429] mca: base: components_open: component sm open function successful [sm-node-8:13429] select: initializing btl component self [sm-node-8:13429] select: init of component self returned success [sm-node-8:13429] select: initializing btl component openib [sm-node-7:28345] mca: base: components_register: component self register function successful [sm-node-8:13430] mca: base: components_open: opening btl components [sm-node-8:13430] mca: base: components_open: found loaded component self [sm-node-8:13430] mca: base: components_open: component self open function successful [sm-node-8:13430] mca: base: components_open: found loaded component openib [sm-node-8:13430] mca: base: components_open: component openib open function successful [sm-node-8:13430] mca: base: components_open: found loaded component sm [sm-node-8:13430] mca: base: components_open: component sm open function successful [sm-node-8:13430] select: initializing btl component self [sm-node-8:13430] select: init of component self returned success [sm-node-8:13430] select: initializing btl component openib [sm-node-7:28344] mca: base: components_register: found loaded component openib [sm-node-8:13427] mca: base: components_register: component openib register function successful [sm-node-7:28345] mca: base: components_register: found loaded component openib [sm-node-8:13427] mca: base: components_register: found loaded component sm [sm-node-7:28342] Checking distance from this process to device=mlx5_1 [sm-node-7:28342] Process is bound: distance to device is 0.000000 [sm-node-7:28342] Checking distance from this process to device=mlx5_0 [sm-node-7:28342] Process is bound: distance to device is 0.000000 [sm-node-8:13427] mca: base: components_register: component sm register function successful [sm-node-7:28346] mca: base: components_register: registering framework btl components [sm-node-7:28346] mca: base: components_register: found loaded component self [sm-node-8:13427] mca: base: components_open: opening btl components [sm-node-8:13427] mca: base: components_open: found loaded component self [sm-node-8:13427] mca: base: components_open: component self open function successful [sm-node-8:13427] mca: base: components_open: found loaded component openib [sm-node-8:13427] mca: base: components_open: component openib open function successful [sm-node-8:13427] mca: base: components_open: found loaded component sm [sm-node-8:13427] mca: base: components_open: component sm open function successful [sm-node-8:13427] select: initializing btl component self [sm-node-8:13427] select: init of component self returned success [sm-node-8:13427] select: initializing btl component openib [sm-node-7:28346] mca: base: components_register: component self register function successful [sm-node-8:13430] Checking distance from this process to device=mlx5_1 [sm-node-8:13430] Process is bound: distance to device is 0.000000 [sm-node-8:13430] Checking distance from this process to device=mlx5_0 [sm-node-8:13430] Process is bound: distance to device is 0.000000 [sm-node-7:28346] mca: base: components_register: found loaded component openib [sm-node-8:13429] Checking distance from this process to device=mlx5_1 [sm-node-8:13429] Process is bound: distance to device is 0.000000 [sm-node-8:13429] Checking distance from this process to device=mlx5_0 [sm-node-8:13429] Process is bound: distance to device is 0.000000 [sm-node-7:28345] mca: base: components_register: component openib register function successful [sm-node-8:13427] Checking distance from this process to device=mlx5_1 [sm-node-8:13427] Process is bound: distance to device is 0.000000 [sm-node-8:13427] Checking distance from this process to device=mlx5_0 [sm-node-8:13427] Process is bound: distance to device is 0.000000 [sm-node-7:28344] mca: base: components_register: component openib register function successful [sm-node-7:28345] mca: base: components_register: found loaded component sm [sm-node-7:28344] mca: base: components_register: found loaded component sm [sm-node-7:28346] mca: base: components_register: component openib register function successful [sm-node-7:28346] mca: base: components_register: found loaded component sm [sm-node-7:28346] mca: base: components_register: component sm register function successful [sm-node-7:28344] mca: base: components_register: component sm register function successful [sm-node-7:28345] mca: base: components_register: component sm register function successful [sm-node-7:28346] mca: base: components_open: opening btl components [sm-node-7:28346] mca: base: components_open: found loaded component self [sm-node-7:28346] mca: base: components_open: component self open function successful [sm-node-7:28346] mca: base: components_open: found loaded component openib [sm-node-7:28346] mca: base: components_open: component openib open function successful [sm-node-7:28346] mca: base: components_open: found loaded component sm [sm-node-7:28346] mca: base: components_open: component sm open function successful [sm-node-7:28346] select: initializing btl component self [sm-node-7:28346] select: init of component self returned success [sm-node-7:28346] select: initializing btl component openib [sm-node-7:28345] mca: base: components_open: opening btl components [sm-node-7:28345] mca: base: components_open: found loaded component self [sm-node-7:28345] mca: base: components_open: component self open function successful [sm-node-7:28345] mca: base: components_open: found loaded component openib [sm-node-7:28345] mca: base: components_open: component openib open function successful [sm-node-7:28345] mca: base: components_open: found loaded component sm [sm-node-7:28345] mca: base: components_open: component sm open function successful [sm-node-7:28345] select: initializing btl component self [sm-node-7:28345] select: init of component self returned success [sm-node-7:28345] select: initializing btl component openib [sm-node-7:28344] mca: base: components_open: opening btl components [sm-node-7:28344] mca: base: components_open: found loaded component self [sm-node-7:28344] mca: base: components_open: component self open function successful [sm-node-7:28344] mca: base: components_open: found loaded component openib [sm-node-7:28344] mca: base: components_open: component openib open function successful [sm-node-7:28344] mca: base: components_open: found loaded component sm [sm-node-7:28344] mca: base: components_open: component sm open function successful [sm-node-7:28344] select: initializing btl component self [sm-node-7:28344] select: init of component self returned success [sm-node-7:28344] select: initializing btl component openib [sm-node-7:28346] Checking distance from this process to device=mlx5_1 [sm-node-7:28346] Process is bound: distance to device is 0.000000 [sm-node-7:28346] Checking distance from this process to device=mlx5_0 [sm-node-7:28346] Process is bound: distance to device is 0.000000 [sm-node-7:28344] Checking distance from this process to device=mlx5_1 [sm-node-7:28345] Checking distance from this process to device=mlx5_1 [sm-node-7:28345] Process is bound: distance to device is 0.000000 [sm-node-7:28345] Checking distance from this process to device=mlx5_0 [sm-node-7:28344] Process is bound: distance to device is 0.000000 [sm-node-7:28344] Checking distance from this process to device=mlx5_0 [sm-node-7:28344] Process is bound: distance to device is 0.000000 [sm-node-7:28345] Process is bound: distance to device is 0.000000 [sm-node-8:13431] mca: base: components_register: registering framework btl components [sm-node-8:13431] mca: base: components_register: found loaded component self [sm-node-8:13431] mca: base: components_register: component self register function successful [sm-node-8:13431] mca: base: components_register: found loaded component openib [sm-node-8:13431] mca: base: components_register: component openib register function successful [sm-node-8:13431] mca: base: components_register: found loaded component sm [sm-node-8:13431] mca: base: components_register: component sm register function successful [sm-node-8:13431] mca: base: components_open: opening btl components [sm-node-8:13431] mca: base: components_open: found loaded component self [sm-node-8:13431] mca: base: components_open: component self open function successful [sm-node-8:13431] mca: base: components_open: found loaded component openib [sm-node-8:13431] mca: base: components_open: component openib open function successful [sm-node-8:13431] mca: base: components_open: found loaded component sm [sm-node-8:13431] mca: base: components_open: component sm open function successful [sm-node-8:13431] select: initializing btl component self [sm-node-8:13431] select: init of component self returned success [sm-node-8:13431] select: initializing btl component openib [sm-node-8:13431] Checking distance from this process to device=mlx5_1 [sm-node-8:13431] Process is bound: distance to device is 0.000000 [sm-node-8:13431] Checking distance from this process to device=mlx5_0 [sm-node-8:13431] Process is bound: distance to device is 0.000000 [sm-node-7:28348] mca: base: components_register: registering framework btl components [sm-node-7:28348] mca: base: components_register: found loaded component self [sm-node-7:28348] mca: base: components_register: component self register function successful [sm-node-7:28348] mca: base: components_register: found loaded component openib [sm-node-8:13433] mca: base: components_register: registering framework btl components [sm-node-8:13433] mca: base: components_register: found loaded component self [sm-node-8:13433] mca: base: components_register: component self register function successful [sm-node-8:13433] mca: base: components_register: found loaded component openib [sm-node-7:28348] mca: base: components_register: component openib register function successful [sm-node-7:28348] mca: base: components_register: found loaded component sm [sm-node-7:28348] mca: base: components_register: component sm register function successful [sm-node-7:28348] mca: base: components_open: opening btl components [sm-node-7:28348] mca: base: components_open: found loaded component self [sm-node-7:28348] mca: base: components_open: component self open function successful [sm-node-7:28348] mca: base: components_open: found loaded component openib [sm-node-7:28348] mca: base: components_open: component openib open function successful [sm-node-7:28348] mca: base: components_open: found loaded component sm [sm-node-7:28348] mca: base: components_open: component sm open function successful [sm-node-7:28348] select: initializing btl component self [sm-node-7:28348] select: init of component self returned success [sm-node-7:28348] select: initializing btl component openib [sm-node-8:13433] mca: base: components_register: component openib register function successful [sm-node-8:13433] mca: base: components_register: found loaded component sm [sm-node-7:28348] Checking distance from this process to device=mlx5_1 [sm-node-8:13433] mca: base: components_register: component sm register function successful [sm-node-7:28348] Process is bound: distance to device is 0.000000 [sm-node-7:28348] Checking distance from this process to device=mlx5_0 [sm-node-7:28348] Process is bound: distance to device is 0.000000 [sm-node-8:13433] mca: base: components_open: opening btl components [sm-node-8:13433] mca: base: components_open: found loaded component self [sm-node-8:13433] mca: base: components_open: component self open function successful [sm-node-8:13433] mca: base: components_open: found loaded component openib [sm-node-8:13433] mca: base: components_open: component openib open function successful [sm-node-8:13433] mca: base: components_open: found loaded component sm [sm-node-8:13433] mca: base: components_open: component sm open function successful [sm-node-8:13433] select: initializing btl component self [sm-node-8:13433] select: init of component self returned success [sm-node-8:13433] select: initializing btl component openib [sm-node-8:13433] Checking distance from this process to device=mlx5_1 [sm-node-8:13433] Process is bound: distance to device is 0.000000 [sm-node-8:13433] Checking distance from this process to device=mlx5_0 [sm-node-8:13433] Process is bound: distance to device is 0.000000 -------------------------------------------------------------------------- WARNING: There was an error initializing an OpenFabrics device. Local host: sm-node-7 Local device: mlx5_0 -------------------------------------------------------------------------- [sm-node-7:28345] select: init of component openib returned failure [sm-node-7:28346] select: init of component openib returned failure [sm-node-7:28346] mca: base: close: component openib closed [sm-node-7:28346] mca: base: close: unloading component openib [sm-node-7:28342] select: init of component openib returned failure [sm-node-7:28342] mca: base: close: component openib closed [sm-node-7:28342] mca: base: close: unloading component openib [sm-node-7:28345] mca: base: close: component openib closed [sm-node-7:28345] mca: base: close: unloading component openib [sm-node-7:28342] select: initializing btl component sm [sm-node-7:28346] select: initializing btl component sm [sm-node-7:28346] select: init of component sm returned success [sm-node-7:28345] select: initializing btl component sm [sm-node-7:28345] select: init of component sm returned success [sm-node-8:13429] select: init of component openib returned failure [sm-node-8:13429] mca: base: close: component openib closed [sm-node-8:13429] mca: base: close: unloading component openib [sm-node-8:13429] select: initializing btl component sm [sm-node-8:13429] select: init of component sm returned success [sm-node-7:28344] select: init of component openib returned failure [sm-node-7:28344] mca: base: close: component openib closed [sm-node-7:28344] mca: base: close: unloading component openib [sm-node-7:28343] select: init of component openib returned failure [sm-node-7:28344] select: initializing btl component sm [sm-node-7:28344] select: init of component sm returned success [sm-node-7:28343] mca: base: close: component openib closed [sm-node-7:28343] mca: base: close: unloading component openib [sm-node-7:28343] select: initializing btl component sm [sm-node-7:28343] select: init of component sm returned success [sm-node-7:28342] select: init of component sm returned success [sm-node-8:13430] select: init of component openib returned failure [sm-node-8:13433] select: init of component openib returned failure [sm-node-8:13427] select: init of component openib returned failure [sm-node-8:13427] mca: base: close: component openib closed [sm-node-8:13427] mca: base: close: unloading component openib [sm-node-8:13430] mca: base: close: component openib closed [sm-node-8:13430] mca: base: close: unloading component openib [sm-node-8:13427] select: initializing btl component sm [sm-node-8:13433] mca: base: close: component openib closed [sm-node-8:13433] mca: base: close: unloading component openib [sm-node-8:13433] select: initializing btl component sm [sm-node-8:13430] select: initializing btl component sm [sm-node-8:13430] select: init of component sm returned success [sm-node-8:13433] select: init of component sm returned success [sm-node-8:13427] select: init of component sm returned success [sm-node-8:13428] select: init of component openib returned failure [sm-node-8:13428] mca: base: close: component openib closed [sm-node-8:13428] mca: base: close: unloading component openib [sm-node-8:13431] select: init of component openib returned failure [sm-node-8:13428] select: initializing btl component sm [sm-node-8:13428] select: init of component sm returned success [sm-node-8:13431] mca: base: close: component openib closed [sm-node-8:13431] mca: base: close: unloading component openib [sm-node-8:13431] select: initializing btl component sm [sm-node-8:13431] select: init of component sm returned success [sm-node-7:28348] select: init of component openib returned failure [sm-node-7:28348] mca: base: close: component openib closed [sm-node-7:28348] mca: base: close: unloading component openib [sm-node-7:28348] select: initializing btl component sm [sm-node-7:28348] select: init of component sm returned success [sm-node-8:13430] mca: bml: Using self btl for send to [[21320,1],9] on node sm-node-8 [sm-node-7:28346] mca: bml: Using self btl for send to [[21320,1],4] on node sm-node-7 [sm-node-8:13427] mca: bml: Using self btl for send to [[21320,1],6] on node sm-node-8 [sm-node-7:28343] mca: bml: Using self btl for send to [[21320,1],1] on node sm-node-7 [sm-node-8:13428] mca: bml: Using self btl for send to [[21320,1],7] on node sm-node-8 [sm-node-7:28344] mca: bml: Using self btl for send to [[21320,1],2] on node sm-node-7 [sm-node-8:13433] mca: bml: Using self btl for send to [[21320,1],11] on node sm-node-8 [sm-node-7:28342] mca: bml: Using self btl for send to [[21320,1],0] on node sm-node-7 [sm-node-8:13429] mca: bml: Using self btl for send to [[21320,1],8] on node sm-node-8 [sm-node-7:28348] mca: bml: Using self btl for send to [[21320,1],5] on node sm-node-7 [sm-node-8:13431] mca: bml: Using self btl for send to [[21320,1],10] on node sm-node-8 [sm-node-7:28345] mca: bml: Using self btl for send to [[21320,1],3] on node sm-node-7 [sm-node-8:13430] mca: bml: Using sm btl for send to [[21320,1],6] on node sm-node-8-ce [sm-node-8:13430] mca: bml: Using sm btl for send to [[21320,1],7] on node sm-node-8-ce [sm-node-8:13430] mca: bml: Using sm btl for send to [[21320,1],8] on node sm-node-8-ce [sm-node-8:13430] mca: bml: Using sm btl for send to [[21320,1],10] on node sm-node-8-ce [sm-node-8:13430] mca: bml: Using sm btl for send to [[21320,1],11] on node sm-node-8-ce [sm-node-7:28343] mca: bml: Using sm btl for send to [[21320,1],0] on node sm-node-7-ce [sm-node-7:28343] mca: bml: Using sm btl for send to [[21320,1],2] on node sm-node-7-ce [sm-node-7:28343] mca: bml: Using sm btl for send to [[21320,1],3] on node sm-node-7-ce [sm-node-7:28343] mca: bml: Using sm btl for send to [[21320,1],4] on node sm-node-7-ce [sm-node-7:28343] mca: bml: Using sm btl for send to [[21320,1],5] on node sm-node-7-ce [sm-node-8:13428] mca: bml: Using sm btl for send to [[21320,1],6] on node sm-node-8-ce [sm-node-8:13428] mca: bml: Using sm btl for send to [[21320,1],8] on node sm-node-8-ce [sm-node-8:13428] mca: bml: Using sm btl for send to [[21320,1],9] on node sm-node-8-ce [sm-node-8:13428] mca: bml: Using sm btl for send to [[21320,1],10] on node sm-node-8-ce [sm-node-8:13428] mca: bml: Using sm btl for send to [[21320,1],11] on node sm-node-8-ce [sm-node-8:13433] mca: bml: Using sm btl for send to [[21320,1],6] on node sm-node-8-ce [sm-node-8:13429] mca: bml: Using sm btl for send to [[21320,1],6] on node sm-node-8-ce [sm-node-8:13429] mca: bml: Using sm btl for send to [[21320,1],7] on node sm-node-8-ce [sm-node-8:13429] mca: bml: Using sm btl for send to [[21320,1],9] on node sm-node-8-ce [sm-node-8:13429] mca: bml: Using sm btl for send to [[21320,1],10] on node sm-node-8-ce [sm-node-8:13429] mca: bml: Using sm btl for send to [[21320,1],11] on node sm-node-8-ce [sm-node-7:28346] mca: bml: Using sm btl for send to [[21320,1],0] on node sm-node-7-ce [sm-node-7:28346] mca: bml: Using sm btl for send to [[21320,1],1] on node sm-node-7-ce [sm-node-7:28346] mca: bml: Using sm btl for send to [[21320,1],2] on node sm-node-7-ce [sm-node-7:28346] mca: bml: Using sm btl for send to [[21320,1],3] on node sm-node-7-ce [sm-node-8:13433] mca: bml: Using sm btl for send to [[21320,1],7] on node sm-node-8-ce [sm-node-8:13433] mca: bml: Using sm btl for send to [[21320,1],8] on node sm-node-8-ce [sm-node-8:13433] mca: bml: Using sm btl for send to [[21320,1],9] on node sm-node-8-ce [sm-node-8:13433] mca: bml: Using sm btl for send to [[21320,1],10] on node sm-node-8-ce [sm-node-7:28346] mca: bml: Using sm btl for send to [[21320,1],5] on node sm-node-7-ce [sm-node-8:13431] mca: bml: Using sm btl for send to [[21320,1],6] on node sm-node-8-ce [sm-node-8:13431] mca: bml: Using sm btl for send to [[21320,1],7] on node sm-node-8-ce [sm-node-8:13431] mca: bml: Using sm btl for send to [[21320,1],8] on node sm-node-8-ce [sm-node-8:13431] mca: bml: Using sm btl for send to [[21320,1],9] on node sm-node-8-ce [sm-node-8:13431] mca: bml: Using sm btl for send to [[21320,1],11] on node sm-node-8-ce [sm-node-7:28344] mca: bml: Using sm btl for send to [[21320,1],0] on node sm-node-7-ce [sm-node-7:28344] mca: bml: Using sm btl for send to [[21320,1],1] on node sm-node-7-ce [sm-node-7:28344] mca: bml: Using sm btl for send to [[21320,1],3] on node sm-node-7-ce [sm-node-7:28344] mca: bml: Using sm btl for send to [[21320,1],4] on node sm-node-7-ce [sm-node-7:28344] mca: bml: Using sm btl for send to [[21320,1],5] on node sm-node-7-ce [sm-node-8:13427] mca: bml: Using sm btl for send to [[21320,1],7] on node sm-node-8-ce [sm-node-8:13427] mca: bml: Using sm btl for send to [[21320,1],8] on node sm-node-8-ce [sm-node-8:13427] mca: bml: Using sm btl for send to [[21320,1],9] on node sm-node-8-ce [sm-node-7:28345] mca: bml: Using sm btl for send to [[21320,1],0] on node sm-node-7-ce [sm-node-7:28345] mca: bml: Using sm btl for send to [[21320,1],1] on node sm-node-7-ce [sm-node-8:13427] mca: bml: Using sm btl for send to [[21320,1],10] on node sm-node-8-ce [sm-node-8:13427] mca: bml: Using sm btl for send to [[21320,1],11] on node sm-node-8-ce [sm-node-7:28345] mca: bml: Using sm btl for send to [[21320,1],2] on node sm-node-7-ce [sm-node-7:28345] mca: bml: Using sm btl for send to [[21320,1],4] on node sm-node-7-ce [sm-node-7:28345] mca: bml: Using sm btl for send to [[21320,1],5] on node sm-node-7-ce [sm-node-7:28348] mca: bml: Using sm btl for send to [[21320,1],0] on node sm-node-7-ce [sm-node-7:28348] mca: bml: Using sm btl for send to [[21320,1],1] on node sm-node-7-ce [sm-node-7:28348] mca: bml: Using sm btl for send to [[21320,1],2] on node sm-node-7-ce [sm-node-7:28348] mca: bml: Using sm btl for send to [[21320,1],3] on node sm-node-7-ce [sm-node-7:28348] mca: bml: Using sm btl for send to [[21320,1],4] on node sm-node-7-ce [sm-node-7:28342] mca: bml: Using sm btl for send to [[21320,1],1] on node sm-node-7-ce [sm-node-7:28342] mca: bml: Using sm btl for send to [[21320,1],2] on node sm-node-7-ce [sm-node-7:28342] mca: bml: Using sm btl for send to [[21320,1],3] on node sm-node-7-ce [sm-node-7:28342] mca: bml: Using sm btl for send to [[21320,1],4] on node sm-node-7-ce [sm-node-7:28342] mca: bml: Using sm btl for send to [[21320,1],5] on node sm-node-7-ce -------------------------------------------------------------------------- At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[21320,1],0]) is on host: sm-node-7 Process 2 ([[21320,1],8]) is on host: sm-node-8-ce BTLs attempted: self sm Your MPI job is now going to abort; sorry. -------------------------------------------------------------------------- [sm-node-7:28344] *** An error occurred in MPI_Bcast [sm-node-7:28344] *** reported by process [140129000226817,2] [sm-node-7:28344] *** on communicator MPI_COMM_WORLD [sm-node-7:28344] *** MPI_ERR_INTERN: internal error [sm-node-7:28344] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [sm-node-7:28344] *** and potentially your MPI job) [sm-node-6:12085] 11 more processes have sent help message help-mpi-btl-openib.txt / error in device init [sm-node-6:12085] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [sm-node-6:12085] 3 more processes have sent help message help-mca-bml-r2.txt / unreachable proc [sm-node-6:12085] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal ]0;root@sm-node-6:~]7;file://sm-node-6/root[root@sm-node-6 ~]# exit exit Script done on Mon 23 Jan 2017 10:15:25 AM EST
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users