Hello, We are running a cluster that has a good number of older nodes with Mellanox IB HCAs that have the "mthca" device name ("ib_mthca" kernel module).
These adapters are all at firmware level 4.8.917 . The Open MPI in use is 1.5.3 , kernel 2.6.39 , x86-64. Jobs are launched/managed using Slurm 2.2.7. The IB software and drivers correspond to OFED 1.5.3.2 , and I've verified that the kernel modules in use are all from this OFED version. On nodes with the mthca hardware *only*, we get frequent, but intermittent job startup failures, with messages like: ///////////////////////////////// [[19373,1],54][btl_openib_component.c:3320:handle_wc] from compute-c3-07 to: compute-c3-01 error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status number 13 for wr_id 2a25c200 opcode 128 vendor error 135 qp_idx 0 -------------------------------------------------------------------------- The OpenFabrics "receiver not ready" retry count on a per-peer connection between two MPI processes has been exceeded. In general, this should not happen because Open MPI uses flow control on per-peer connections to ensure that receivers are always ready when data is sent. [further standard error text snipped...] Below is some information about the host that raised the error and the peer to which it was connected: Local host: compute-c3-07 Local device: mthca0 Peer host: compute-c3-01 You may need to consult with your system administrator to get this problem fixed. -------------------------------------------------------------------------- ///////////////////////////////// During these job runs, I have monitored the InfiniBand performance counters on the endpoints and switch. No telltale counters for any of these ports change during these failed job initiations. ibdiagnet works fine and properly enumerates the fabric and related performance counters, both from the affected nodes, as well as other nodes attached to the IB switch. The IB connectivity itself seems fine from these nodes. Other nodes with different HCAs use the same InfiniBand fabric continuously without any issue, so I don't think it's the fabric/switch. I'm at a loss for what to do next to try and find the root cause of the issue. I suspect something perhaps having to do with the mthca support/drivers, but how can I track this down further? Thank you, V. Ram. -- http://www.fastmail.fm - Or how I learned to stop worrying and love email again