Jeremy, I implemented the APM support for openib btl a long time ago. I do not remember all the details of the implementation, but I remember that it is used to support LMC bits and multiple ib ports. Unfortunately I'm super busy this week. I will try look at it early next week.
Pavel (Pasha) Shamis --- Application Performance Tools Group Computer Science and Math Division Oak Ridge National Laboratory On Feb 22, 2012, at 1:44 PM, Jeremy wrote: > Hi, > > I am have a problem getting Alternative Path Migration (APM) to work > over the InfiniBand ports on my HCA. > > Details on my configuration and the issue I have are below. Please > let me know if you can provide any suggestions or corrections to my > configuration? I will be happy to try other experiments and tests or > provide additional details to debug this problem further. > > I have reviewed the Open MPI FAQ and the archive of this mailing list > but I was unable to resolve my problem. There was one thread on > mult-rail fail-over with IB but it did not provide sufficient > information. > > Thanks for your help, > Jeremy > > Configuration: > MPI version 1.4.3 Bundled with OFED. > I have also tested with MPI version 1.5.4 but the results were the same. > > I have 2 machines, each machine has a dual port Mellanox IB HCA > Mellanox MCX354A-FCBT (ConnectX-3 FDR). > I have cabled both ports of each HCA to the same IB Switch (Mellanox SX6036). > > What I expected to happen: > I am trying to migrate data transmission between 2 ports of the same HCA. > Start an MPI application. Unplug the fiber cable from Port 1 of an > HCA. Observe that the MPI application continues and data is sent > across Port 2 of the HCA. > > However, when I unplug the cable from Port 1 of the IB HCA, the MPI > application hangs and I get the following error messages: > Error 10: IBV_EVENT_PORT_ERR > Error 7: IBV_EVENT_PATH_MIG_ERR > Alternative path migration event reported > Trying to find additional path… > APM: already all ports were used port_num 2 apm_port 2 > > I've pasted the full verbose error message at the bottom of this email. > > I started the MPI application using the following mpirun invocation: > mpirun –np 2 –machinefile machines –mca btl_openib_enable_apm_over_ports 1 > demo > > What works: > I think that the low level Mellanox IB hardware is working as > expected. The switch, cables and both HCA ports move data OK. > If I don't use the btl_openib_enable_apm_over_ports option then MPI > traffic is evenly spread across both Port 1 and Port 2 while it is > running. > Also, I am able to successfully do fail-over using a bonded device > with IP. For example, if I use netperf to send TCP data over a bonded > IPoIB device I get the expected behavior. When I unplug Port 1, > netperf keeps running and traffic goes over Port 2. > > Detailed Error Message: > -------------------------------------------------------------------------- > The OpenFabrics stack has reported a network error event. Open MPI > will try to continue, but your job may end up failing. > > Local host: bones > MPI process PID: 23115 > Error number: 10 (IBV_EVENT_PORT_ERR) > > This error may indicate connectivity problems within the fabric; > please contact your system administrator. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > The OpenFabrics stack has reported a network error event. Open MPI > will try to continue, but your job may end up failing. > > Local host: bones > MPI process PID: 23115 > Error number: 7 (IBV_EVENT_PATH_MIG_ERR) > > This error may indicate connectivity problems within the fabric; > please contact your system administrator. > -------------------------------------------------------------------------- > [bones][[57528,1],0][btl_openib_async.c:327:btl_openib_async_deviceh] > Alternative path migration event reported > [bones][[57528,1],0][btl_openib_async.c:329:btl_openib_async_deviceh] > Trying to find additional path... > [bones][[57528,1],0][btl_openib_async.c:516:apm_update_port] APM: > already all ports were used port_num 2 apm_port 2 > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users