Open MPI InfiniBand gurus and/or Mellanox: could I please get some assistance with this? Any suggestions on tunables or debugging parameters to try?
Thank you very much. On Mon, Dec 12, 2011, at 10:42 AM, V. Ram wrote: > Hello, > > We are running a cluster that has a good number of older nodes with > Mellanox IB HCAs that have the "mthca" device name ("ib_mthca" kernel > module). > > These adapters are all at firmware level 4.8.917 . > > The Open MPI in use is 1.5.3 , kernel 2.6.39 , x86-64. Jobs are > launched/managed using Slurm 2.2.7. The IB software and drivers > correspond to OFED 1.5.3.2 , and I've verified that the kernel modules > in use are all from this OFED version. > > On nodes with the mthca hardware *only*, we get frequent, but > intermittent job startup failures, with messages like: > > ///////////////////////////////// > > [[19373,1],54][btl_openib_component.c:3320:handle_wc] from compute-c3-07 > to: compute-c3-01 error polling LP CQ with status RECEIVER NOT READY > RETRY EXCEEDED ERROR status > number 13 for wr_id 2a25c200 opcode 128 vendor error 135 qp_idx 0 > > -------------------------------------------------------------------------- > The OpenFabrics "receiver not ready" retry count on a per-peer > connection between two MPI processes has been exceeded. In general, > this should not happen because Open MPI uses flow control on per-peer > connections to ensure that receivers are always ready when data is > sent. > > [further standard error text snipped...] > > Below is some information about the host that raised the error and the > peer to which it was connected: > > Local host: compute-c3-07 > Local device: mthca0 > Peer host: compute-c3-01 > > You may need to consult with your system administrator to get this > problem fixed. > -------------------------------------------------------------------------- > > ///////////////////////////////// > > During these job runs, I have monitored the InfiniBand performance > counters on the endpoints and switch. No telltale counters for any of > these ports change during these failed job initiations. > > ibdiagnet works fine and properly enumerates the fabric and related > performance counters, both from the affected nodes, as well as other > nodes attached to the IB switch. The IB connectivity itself seems fine > from these nodes. > > Other nodes with different HCAs use the same InfiniBand fabric > continuously without any issue, so I don't think it's the fabric/switch. > > I'm at a loss for what to do next to try and find the root cause of the > issue. I suspect something perhaps having to do with the mthca > support/drivers, but how can I track this down further? > > Thank you, > > V. Ram. -- http://www.fastmail.fm - Choose from over 50 domains or use your own