Very strange. I have a lot of older mthca-based HCAs in my Cisco MPI test cluster, and I don't see these kinds of problems.
Mellanox -- any ideas? On Dec 15, 2011, at 7:24 PM, V. Ram wrote: > Hi Terry, > > Thanks so much for the response. My replies are in-line below. > > On Thu, Dec 15, 2011, at 07:00 AM, TERRY DONTJE wrote: >> IIRC, RNR's are usually due to the receiving side not having a segment >> registered and ready to receive data on a QP. The btl does go through a >> big dance and does its own flow control to make sure this doesn't happen. >> >> So when this happens are both the sending and receiving nodes using >> mthca's to communicate with? > > Yes. For the newer nodes using onboard mlx4, this issue doesn't arise. > The mlx4-based nodes are using the same core switch as the mthca nodes. > >> By any chance is it a particular node (or pair of nodes) this seems to >> happen with? > > No. I've got 40 nodes total with this hardware configuration, and the > problem has been seen on most/all nodes at one time or another. It > doesn't seem, based on the limited number of observable parameters I'm > aware of, to be dependent on the number of nodes involved. > > It is an intermittent problem, but when it happens, it happens at job > launch, and it does occur most of the time. > > Thanks, > > V. Ram > >> --td >>> >>> Open MPI InfiniBand gurus and/or Mellanox: could I please get some >>> assistance with this? Any suggestions on tunables or debugging >>> parameters to try? >>> >>> Thank you very much. >>> >>> On Mon, Dec 12, 2011, at 10:42 AM, V. Ram wrote: >>>> Hello, >>>> >>>> We are running a cluster that has a good number of older nodes with >>>> Mellanox IB HCAs that have the "mthca" device name ("ib_mthca" kernel >>>> module). >>>> >>>> These adapters are all at firmware level 4.8.917 . >>>> >>>> The Open MPI in use is 1.5.3 , kernel 2.6.39 , x86-64. Jobs are >>>> launched/managed using Slurm 2.2.7. The IB software and drivers >>>> correspond to OFED 1.5.3.2 , and I've verified that the kernel modules >>>> in use are all from this OFED version. >>>> >>>> On nodes with the mthca hardware *only*, we get frequent, but >>>> intermittent job startup failures, with messages like: >>>> >>>> ///////////////////////////////// >>>> >>>> [[19373,1],54][btl_openib_component.c:3320:handle_wc] from compute-c3-07 >>>> to: compute-c3-01 error polling LP CQ with status RECEIVER NOT READY >>>> RETRY EXCEEDED ERROR status >>>> number 13 for wr_id 2a25c200 opcode 128 vendor error 135 qp_idx 0 >>>> >>>> -------------------------------------------------------------------------- >>>> The OpenFabrics "receiver not ready" retry count on a per-peer >>>> connection between two MPI processes has been exceeded. In general, >>>> this should not happen because Open MPI uses flow control on per-peer >>>> connections to ensure that receivers are always ready when data is >>>> sent. >>>> >>>> [further standard error text snipped...] >>>> >>>> Below is some information about the host that raised the error and the >>>> peer to which it was connected: >>>> >>>> Local host: compute-c3-07 >>>> Local device: mthca0 >>>> Peer host: compute-c3-01 >>>> >>>> You may need to consult with your system administrator to get this >>>> problem fixed. >>>> -------------------------------------------------------------------------- >>>> >>>> ///////////////////////////////// >>>> >>>> During these job runs, I have monitored the InfiniBand performance >>>> counters on the endpoints and switch. No telltale counters for any of >>>> these ports change during these failed job initiations. >>>> >>>> ibdiagnet works fine and properly enumerates the fabric and related >>>> performance counters, both from the affected nodes, as well as other >>>> nodes attached to the IB switch. The IB connectivity itself seems fine >>>> from these nodes. >>>> >>>> Other nodes with different HCAs use the same InfiniBand fabric >>>> continuously without any issue, so I don't think it's the fabric/switch. >>>> >>>> I'm at a loss for what to do next to try and find the root cause of the >>>> issue. I suspect something perhaps having to do with the mthca >>>> support/drivers, but how can I track this down further? >>>> >>>> Thank you, >>>> >>>> V. Ram. > > -- > http://www.fastmail.fm - One of many happy users: > http://www.fastmail.fm/docs/quotes.html > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/