Hello,

We are running a cluster that has a good number of older nodes with
Mellanox IB HCAs that have the "mthca" device name ("ib_mthca" kernel
module).

These adapters are all at firmware level 4.8.917 .

The Open MPI in use is 1.5.3 , kernel 2.6.39 , x86-64.  Jobs are
launched/managed using Slurm 2.2.7.  The IB software and drivers
correspond to OFED 1.5.3.2 , and I've verified that the kernel modules
in use are all from this OFED version.

On nodes with the mthca hardware *only*, we get frequent, but
intermittent job startup failures, with messages like:

/////////////////////////////////

[[19373,1],54][btl_openib_component.c:3320:handle_wc] from compute-c3-07
to: compute-c3-01 error polling LP CQ with status RECEIVER NOT READY
RETRY EXCEEDED ERROR status
number 13 for wr_id 2a25c200 opcode 128 vendor error 135 qp_idx 0

--------------------------------------------------------------------------
The OpenFabrics "receiver not ready" retry count on a per-peer
connection between two MPI processes has been exceeded.  In general,
this should not happen because Open MPI uses flow control on per-peer
connections to ensure that receivers are always ready when data is
sent.

[further standard error text snipped...]

Below is some information about the host that raised the error and the
peer to which it was connected:

  Local host:   compute-c3-07
  Local device: mthca0
  Peer host:    compute-c3-01

You may need to consult with your system administrator to get this
problem fixed.
--------------------------------------------------------------------------

/////////////////////////////////

During these job runs, I have monitored the InfiniBand performance
counters on the endpoints and switch.  No telltale counters for any of
these ports change during these failed job initiations.

ibdiagnet works fine and properly enumerates the fabric and related
performance counters, both from the affected nodes, as well as other
nodes attached to the IB switch.  The IB connectivity itself seems fine
from these nodes.

Other nodes with different HCAs use the same InfiniBand fabric
continuously without any issue, so I don't think it's the fabric/switch.

I'm at a loss for what to do next to try and find the root cause of the
issue.  I suspect something perhaps having to do with the mthca
support/drivers, but how can I track this down further?

Thank you,

V. Ram.

-- 
http://www.fastmail.fm - Or how I learned to stop worrying and
                          love email again

Reply via email to