Open MPI InfiniBand gurus and/or Mellanox: could I please get some
assistance with this?  Any suggestions on tunables or debugging
parameters to try?

Thank you very much.

On Mon, Dec 12, 2011, at 10:42 AM, V. Ram wrote:
> Hello,
> 
> We are running a cluster that has a good number of older nodes with
> Mellanox IB HCAs that have the "mthca" device name ("ib_mthca" kernel
> module).
> 
> These adapters are all at firmware level 4.8.917 .
> 
> The Open MPI in use is 1.5.3 , kernel 2.6.39 , x86-64.  Jobs are
> launched/managed using Slurm 2.2.7.  The IB software and drivers
> correspond to OFED 1.5.3.2 , and I've verified that the kernel modules
> in use are all from this OFED version.
> 
> On nodes with the mthca hardware *only*, we get frequent, but
> intermittent job startup failures, with messages like:
> 
> /////////////////////////////////
> 
> [[19373,1],54][btl_openib_component.c:3320:handle_wc] from compute-c3-07
> to: compute-c3-01 error polling LP CQ with status RECEIVER NOT READY
> RETRY EXCEEDED ERROR status
> number 13 for wr_id 2a25c200 opcode 128 vendor error 135 qp_idx 0
> 
> --------------------------------------------------------------------------
> The OpenFabrics "receiver not ready" retry count on a per-peer
> connection between two MPI processes has been exceeded.  In general,
> this should not happen because Open MPI uses flow control on per-peer
> connections to ensure that receivers are always ready when data is
> sent.
> 
> [further standard error text snipped...]
> 
> Below is some information about the host that raised the error and the
> peer to which it was connected:
> 
>   Local host:   compute-c3-07
>   Local device: mthca0
>   Peer host:    compute-c3-01
> 
> You may need to consult with your system administrator to get this
> problem fixed.
> --------------------------------------------------------------------------
> 
> /////////////////////////////////
> 
> During these job runs, I have monitored the InfiniBand performance
> counters on the endpoints and switch.  No telltale counters for any of
> these ports change during these failed job initiations.
> 
> ibdiagnet works fine and properly enumerates the fabric and related
> performance counters, both from the affected nodes, as well as other
> nodes attached to the IB switch.  The IB connectivity itself seems fine
> from these nodes.
> 
> Other nodes with different HCAs use the same InfiniBand fabric
> continuously without any issue, so I don't think it's the fabric/switch.
> 
> I'm at a loss for what to do next to try and find the root cause of the
> issue.  I suspect something perhaps having to do with the mthca
> support/drivers, but how can I track this down further?
> 
> Thank you,
> 
> V. Ram.

-- 
http://www.fastmail.fm - Choose from over 50 domains or use your own

Reply via email to