Hello Brendan,

Sorry for the delay in responding.  I've been on travel the past two weeks.

I traced through the debug output you sent.  It provided enough information
to show that for some reason, when using the breakout cable, Open MPI
is unable to complete initialization it needs to use the openib BTL.  It
correctly detects that the first port is not available, but for port 1, it
still fails to initialize.

To debug this further, I'd need to provide you with a custom Open MPI
to try that would have more debug output in the suspect area.

If you'd like to go this route let me know and I'll build a one of library
to try to debug this problem.

One thing to do just as a sanity check is to try tcp:

mpirun --mca btl tcp,self,sm ....

with the breakout cable.  If that doesn't work, then I think there may
be some network setup problem that needs to be resolved first before
trying custom Open MPI tarballs.

Thanks,

Howard




2017-02-01 15:08 GMT-07:00 Brendan Myers <brendan.my...@soft-forge.com>:

> Hello Howard,
>
> I was wondering if you have been able to look at this issue at all, or if
> anyone has any ideas on what to try next.
>
>
>
> Thank you,
>
> Brendan
>
>
>
> *From:* users [mailto:users-boun...@lists.open-mpi.org] *On Behalf Of *Brendan
> Myers
> *Sent:* Tuesday, January 24, 2017 11:11 AM
>
> *To:* 'Open MPI Users' <users@lists.open-mpi.org>
> *Subject:* Re: [OMPI users] Open MPI over RoCE using breakout cable and
> switch
>
>
>
> Hello Howard,
>
> Here is the error output after building with debug enabled.  These CX4
> Mellanox cards view each port as a separate device and I am using port 1 on
> the card which is device mlx5_0.
>
>
>
> Thank you,
>
> Brendan
>
>
>
> *From:* users [mailto:users-boun...@lists.open-mpi.org
> <users-boun...@lists.open-mpi.org>] *On Behalf Of *Howard Pritchard
> *Sent:* Tuesday, January 24, 2017 8:21 AM
> *To:* Open MPI Users <users@lists.open-mpi.org>
> *Subject:* Re: [OMPI users] Open MPI over RoCE using breakout cable and
> switch
>
>
>
> Hello Brendan,
>
>
>
> This helps some, but looks like we need more debug output.
>
>
>
> Could you build a debug version of Open MPI by adding --enable-debug
>
> to the config options and rerun the test with the breakout cable setup
>
> and keeping the --mca btl_base_verbose 100 command line option?
>
>
>
> Thanks
>
>
>
> Howard
>
>
>
>
>
> 2017-01-23 8:23 GMT-07:00 Brendan Myers <brendan.my...@soft-forge.com>:
>
> Hello Howard,
>
> Thank you for looking into this. Attached is the output you requested.
> Also, I am using Open MPI 2.0.1.
>
>
>
> Thank you,
>
> Brendan
>
>
>
> *From:* users [mailto:users-boun...@lists.open-mpi.org] *On Behalf Of *Howard
> Pritchard
> *Sent:* Friday, January 20, 2017 6:35 PM
> *To:* Open MPI Users <users@lists.open-mpi.org>
> *Subject:* Re: [OMPI users] Open MPI over RoCE using breakout cable and
> switch
>
>
>
> Hi Brendan
>
>
>
> I doubt this kind of config has gotten any testing with OMPI.  Could you
> rerun with
>
>
>
> --mca btl_base_verbose 100
>
>
>
> added to the command line and post the output to the list?
>
>
>
> Howard
>
>
>
>
>
> Brendan Myers <brendan.my...@soft-forge.com> schrieb am Fr. 20. Jan. 2017
> um 15:04:
>
> Hello,
>
> I am attempting to get Open MPI to run over 2 nodes using a switch and a
> single breakout cable with this design:
>
> (100GbE)QSFP ßà 2x (50GbE)QSFP
>
>
>
> Hardware Layout:
>
> Breakout cable module A connects to switch (100GbE)
>
> Breakout cable module B1 connects to node 1 RoCE NIC (50GbE)
>
> Breakout cable module B2 connects to node 2 RoCE NIC (50GbE)
>
> Switch is Mellanox SN 2700 100GbE RoCE switch
>
>
>
> ·         I  am able to pass RDMA traffic between the nodes with perftest
> (ib_write_bw) when using the breakout cable as the IC from both nodes to
> the switch.
>
> ·         When attempting to run a job using the breakout cable as the IC
> Open MPI aborts with failure to initialize open fabrics device errors.
>
> ·         If I replace the breakout cable with 2 standard QSFP cables the
> Open MPI job will complete correctly.
>
>
>
>
>
> This is the command I use, it works unless I attempt a run with the
> breakout cable used as IC:
>
> *mpirun --mca btl openib,self,sm --mca btl_openib_receive_queues
> P,65536,120,64,32 --mca btl_openib_cpc_include rdmacm  -hostfile
> mpi-hosts-ce /usr/local/bin/IMB-MPI1*
>
>
>
> If anyone has any idea as to why using a breakout cable is causing my jobs
> to fail please let me know.
>
>
>
> Thank you,
>
>
>
> Brendan T. W. Myers
>
> brendan.my...@soft-forge.com
>
> Software Forge Inc
>
>
>
> _______________________________________________
>
> users mailing list
>
> users@lists.open-mpi.org
>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to