Hi Pasha, I just wanted to check if you had any further suggestions regarding the APM issue based on the updated info in my previous email.
Thanks, -Jeremy On Mon, Mar 12, 2012 at 12:43 PM, Jeremy <spritzy...@gmail.com> wrote: > Hi Pasha, Yevgeny, > >>> My educated guess is that from some reason it is no direct connection path >>> between lid-2 and lid-4. To prove it we have to look and the OpenSM routing >>> information. > >> If you don't get response or you get info of >> the device different that what you would expect, >> then the two ports are not part of the same >> subnet, and APN is expected to fail. >> Otherwise - it's probably a bug. > > I've tried your suggestions and the details are below. I am now > testing with a trivial MPI application that just does an > MPI_Send/MPI_Recv and then sleeps for a while (attached). There is > much less output to weed through now! > > When I unplug a cable from Port 1, the LID associated with Port 2 is > still reachable with smpquery. So it looks like there should be a > valid path to migrate to on the same subnet. > > I am using 2 hosts in this output > sulu: This is the host where I unplug the cable from Port 1. The > cable on Port 2 is connected all the time. LIDs 4 and 5. > bones: On this host I leave cables connected to both Ports all the > time.LIDs 2 and 3. > > A) Before I start, sulu shows that both Ports are up and active using > LIDs 4 and 5: > sulu> ibstatus > Infiniband device 'mlx4_0' port 1 status: > default gid: fe80:0000:0000:0000:0002:c903:0033:6fe1 > base lid: 0x4 > sm lid: 0x6 > state: 4: ACTIVE > phys state: 5: LinkUp > rate: 56 Gb/sec (4X FDR) > link_layer: InfiniBand > > Infiniband device 'mlx4_0' port 2 status: > default gid: fe80:0000:0000:0000:0002:c903:0033:6fe2 > base lid: 0x5 > sm lid: 0x6 > state: 4: ACTIVE > phys state: 5: LinkUp > rate: 56 Gb/sec (4X FDR) > link_layer: InfiniBand > > B) The other host, bones, is able to get to LIDs 4 and 5 OK: > bones> smpquery --Ca mlx4_0 --Port 1 NodeInfo 4 > # Node info: Lid 4 > BaseVers:........................1 > ClassVers:.......................1 > NodeType:........................Channel Adapter > NumPorts:........................2 > SystemGuid:......................0x0002c90300336fe3 > Guid:............................0x0002c90300336fe0 > PortGuid:........................0x0002c90300336fe1 > PartCap:.........................128 > DevId:...........................0x1003 > Revision:........................0x00000000 > LocalPort:.......................1 > VendorId:........................0x0002c9 > > bones> smpquery --Ca mlx4_0 --Port 1 NodeInfo 5 > # Node info: Lid 5 > BaseVers:........................1 > ClassVers:.......................1 > NodeType:........................Channel Adapter > NumPorts:........................2 > SystemGuid:......................0x0002c90300336fe3 > Guid:............................0x0002c90300336fe0 > PortGuid:........................0x0002c90300336fe2 > PartCap:.........................128 > DevId:...........................0x1003 > Revision:........................0x00000000 > LocalPort:.......................2 > VendorId:........................0x0002c9 > > C) I start the MPI program. See attached file for output. > > D) During Iteration 3, I unplugged the cable on Port 1 of sulu. > - I get the expected network error event message. > - sulu shows that Port 1 is down and Port 2 is active as expected. > - bones is still able to get to LID 5 on Port 2 of sulu as expected. > - The MPI application hangs and then terminates instead of running via LID 5. > > sulu> ibstatus > Infiniband device 'mlx4_0' port 1 status: > default gid: fe80:0000:0000:0000:0002:c903:0033:6fe1 > base lid: 0x4 > sm lid: 0x6 > state: 1: DOWN > phys state: 2: Polling > rate: 40 Gb/sec (4X QDR) > link_layer: InfiniBand > > Infiniband device 'mlx4_0' port 2 status: > default gid: fe80:0000:0000:0000:0002:c903:0033:6fe2 > base lid: 0x5 > sm lid: 0x6 > state: 4: ACTIVE > phys state: 5: LinkUp > rate: 56 Gb/sec (4X FDR) > link_layer: InfiniBand > > bones> smpquery --Ca mlx4_0 --Port 1 NodeInfo 4 > ibwarn: [11192] mad_rpc: _do_madrpc failed; dport (Lid 4) > smpquery: iberror: failed: operation NodeInfo: node info query failed > > bones> smpquery --Ca mlx4_0 --Port 1 NodeInfo 5 > # Node info: Lid 5 > BaseVers:........................1 > ClassVers:.......................1 > NodeType:........................Channel Adapter > NumPorts:........................2 > SystemGuid:......................0x0002c90300336fe3 > Guid:............................0x0002c90300336fe0 > PortGuid:........................0x0002c90300336fe2 > PartCap:.........................128 > DevId:...........................0x1003 > Revision:........................0x00000000 > LocalPort:.......................2 > VendorId:........................0x0002c9 > > Thanks, > > -Jeremy