Hi, I am have a problem getting Alternative Path Migration (APM) to work over the InfiniBand ports on my HCA.
Details on my configuration and the issue I have are below. Please let me know if you can provide any suggestions or corrections to my configuration? I will be happy to try other experiments and tests or provide additional details to debug this problem further. I have reviewed the Open MPI FAQ and the archive of this mailing list but I was unable to resolve my problem. There was one thread on mult-rail fail-over with IB but it did not provide sufficient information. Thanks for your help, Jeremy Configuration: MPI version 1.4.3 Bundled with OFED. I have also tested with MPI version 1.5.4 but the results were the same. I have 2 machines, each machine has a dual port Mellanox IB HCA Mellanox MCX354A-FCBT (ConnectX-3 FDR). I have cabled both ports of each HCA to the same IB Switch (Mellanox SX6036). What I expected to happen: I am trying to migrate data transmission between 2 ports of the same HCA. Start an MPI application. Unplug the fiber cable from Port 1 of an HCA. Observe that the MPI application continues and data is sent across Port 2 of the HCA. However, when I unplug the cable from Port 1 of the IB HCA, the MPI application hangs and I get the following error messages: Error 10: IBV_EVENT_PORT_ERR Error 7: IBV_EVENT_PATH_MIG_ERR Alternative path migration event reported Trying to find additional path… APM: already all ports were used port_num 2 apm_port 2 I've pasted the full verbose error message at the bottom of this email. I started the MPI application using the following mpirun invocation: mpirun –np 2 –machinefile machines –mca btl_openib_enable_apm_over_ports 1 demo What works: I think that the low level Mellanox IB hardware is working as expected. The switch, cables and both HCA ports move data OK. If I don't use the btl_openib_enable_apm_over_ports option then MPI traffic is evenly spread across both Port 1 and Port 2 while it is running. Also, I am able to successfully do fail-over using a bonded device with IP. For example, if I use netperf to send TCP data over a bonded IPoIB device I get the expected behavior. When I unplug Port 1, netperf keeps running and traffic goes over Port 2. Detailed Error Message: -------------------------------------------------------------------------- The OpenFabrics stack has reported a network error event. Open MPI will try to continue, but your job may end up failing. Local host: bones MPI process PID: 23115 Error number: 10 (IBV_EVENT_PORT_ERR) This error may indicate connectivity problems within the fabric; please contact your system administrator. -------------------------------------------------------------------------- -------------------------------------------------------------------------- The OpenFabrics stack has reported a network error event. Open MPI will try to continue, but your job may end up failing. Local host: bones MPI process PID: 23115 Error number: 7 (IBV_EVENT_PATH_MIG_ERR) This error may indicate connectivity problems within the fabric; please contact your system administrator. -------------------------------------------------------------------------- [bones][[57528,1],0][btl_openib_async.c:327:btl_openib_async_deviceh] Alternative path migration event reported [bones][[57528,1],0][btl_openib_async.c:329:btl_openib_async_deviceh] Trying to find additional path... [bones][[57528,1],0][btl_openib_async.c:516:apm_update_port] APM: already all ports were used port_num 2 apm_port 2