Jeremy, As far as I understand the tool that Evgeny recommended showed that the remote port is reachable. Based on the log that have been provided I can't find the issue in ompi, everything seems to be kosher. Unfortunately, I do not have a platform where I may try to reproduce the issue. I would as Evegeny, maybe Mellanox will be able to reproduce and debug the issue.
Pavel (Pasha) Shamis --- Application Performance Tools Group Computer Science and Math Division Oak Ridge National Laboratory On Mar 21, 2012, at 9:31 AM, Jeremy wrote: > Hi Pasha, > > I just wanted to check if you had any further suggestions regarding > the APM issue based on the updated info in my previous email. > > Thanks, > > -Jeremy > > On Mon, Mar 12, 2012 at 12:43 PM, Jeremy <spritzy...@gmail.com> wrote: >> Hi Pasha, Yevgeny, >> >>>> My educated guess is that from some reason it is no direct connection path >>>> between lid-2 and lid-4. To prove it we have to look and the OpenSM routing >>>> information. >> >>> If you don't get response or you get info of >>> the device different that what you would expect, >>> then the two ports are not part of the same >>> subnet, and APN is expected to fail. >>> Otherwise - it's probably a bug. >> >> I've tried your suggestions and the details are below. I am now >> testing with a trivial MPI application that just does an >> MPI_Send/MPI_Recv and then sleeps for a while (attached). There is >> much less output to weed through now! >> >> When I unplug a cable from Port 1, the LID associated with Port 2 is >> still reachable with smpquery. So it looks like there should be a >> valid path to migrate to on the same subnet. >> >> I am using 2 hosts in this output >> sulu: This is the host where I unplug the cable from Port 1. The >> cable on Port 2 is connected all the time. LIDs 4 and 5. >> bones: On this host I leave cables connected to both Ports all the >> time.LIDs 2 and 3. >> >> A) Before I start, sulu shows that both Ports are up and active using >> LIDs 4 and 5: >> sulu> ibstatus >> Infiniband device 'mlx4_0' port 1 status: >> default gid: fe80:0000:0000:0000:0002:c903:0033:6fe1 >> base lid: 0x4 >> sm lid: 0x6 >> state: 4: ACTIVE >> phys state: 5: LinkUp >> rate: 56 Gb/sec (4X FDR) >> link_layer: InfiniBand >> >> Infiniband device 'mlx4_0' port 2 status: >> default gid: fe80:0000:0000:0000:0002:c903:0033:6fe2 >> base lid: 0x5 >> sm lid: 0x6 >> state: 4: ACTIVE >> phys state: 5: LinkUp >> rate: 56 Gb/sec (4X FDR) >> link_layer: InfiniBand >> >> B) The other host, bones, is able to get to LIDs 4 and 5 OK: >> bones> smpquery --Ca mlx4_0 --Port 1 NodeInfo 4 >> # Node info: Lid 4 >> BaseVers:........................1 >> ClassVers:.......................1 >> NodeType:........................Channel Adapter >> NumPorts:........................2 >> SystemGuid:......................0x0002c90300336fe3 >> Guid:............................0x0002c90300336fe0 >> PortGuid:........................0x0002c90300336fe1 >> PartCap:.........................128 >> DevId:...........................0x1003 >> Revision:........................0x00000000 >> LocalPort:.......................1 >> VendorId:........................0x0002c9 >> >> bones> smpquery --Ca mlx4_0 --Port 1 NodeInfo 5 >> # Node info: Lid 5 >> BaseVers:........................1 >> ClassVers:.......................1 >> NodeType:........................Channel Adapter >> NumPorts:........................2 >> SystemGuid:......................0x0002c90300336fe3 >> Guid:............................0x0002c90300336fe0 >> PortGuid:........................0x0002c90300336fe2 >> PartCap:.........................128 >> DevId:...........................0x1003 >> Revision:........................0x00000000 >> LocalPort:.......................2 >> VendorId:........................0x0002c9 >> >> C) I start the MPI program. See attached file for output. >> >> D) During Iteration 3, I unplugged the cable on Port 1 of sulu. >> - I get the expected network error event message. >> - sulu shows that Port 1 is down and Port 2 is active as expected. >> - bones is still able to get to LID 5 on Port 2 of sulu as expected. >> - The MPI application hangs and then terminates instead of running via LID 5. >> >> sulu> ibstatus >> Infiniband device 'mlx4_0' port 1 status: >> default gid: fe80:0000:0000:0000:0002:c903:0033:6fe1 >> base lid: 0x4 >> sm lid: 0x6 >> state: 1: DOWN >> phys state: 2: Polling >> rate: 40 Gb/sec (4X QDR) >> link_layer: InfiniBand >> >> Infiniband device 'mlx4_0' port 2 status: >> default gid: fe80:0000:0000:0000:0002:c903:0033:6fe2 >> base lid: 0x5 >> sm lid: 0x6 >> state: 4: ACTIVE >> phys state: 5: LinkUp >> rate: 56 Gb/sec (4X FDR) >> link_layer: InfiniBand >> >> bones> smpquery --Ca mlx4_0 --Port 1 NodeInfo 4 >> ibwarn: [11192] mad_rpc: _do_madrpc failed; dport (Lid 4) >> smpquery: iberror: failed: operation NodeInfo: node info query failed >> >> bones> smpquery --Ca mlx4_0 --Port 1 NodeInfo 5 >> # Node info: Lid 5 >> BaseVers:........................1 >> ClassVers:.......................1 >> NodeType:........................Channel Adapter >> NumPorts:........................2 >> SystemGuid:......................0x0002c90300336fe3 >> Guid:............................0x0002c90300336fe0 >> PortGuid:........................0x0002c90300336fe2 >> PartCap:.........................128 >> DevId:...........................0x1003 >> Revision:........................0x00000000 >> LocalPort:.......................2 >> VendorId:........................0x0002c9 >> >> Thanks, >> >> -Jeremy