On Thu, Dec 15, 2011, at 09:28 PM, Jeff Squyres wrote: > Very strange. I have a lot of older mthca-based HCAs in my Cisco MPI > test cluster, and I don't see these kinds of problems.
Can I ask what version of OFED you're using, or what version of OFED the IB software stack is coming from? Thank you. V. Ram > On Dec 15, 2011, at 7:24 PM, V. Ram wrote: > > > Hi Terry, > > > > Thanks so much for the response. My replies are in-line below. > > > > On Thu, Dec 15, 2011, at 07:00 AM, TERRY DONTJE wrote: > >> IIRC, RNR's are usually due to the receiving side not having a segment > >> registered and ready to receive data on a QP. The btl does go through a > >> big dance and does its own flow control to make sure this doesn't happen. > >> > >> So when this happens are both the sending and receiving nodes using > >> mthca's to communicate with? > > > > Yes. For the newer nodes using onboard mlx4, this issue doesn't arise. > > The mlx4-based nodes are using the same core switch as the mthca nodes. > > > >> By any chance is it a particular node (or pair of nodes) this seems to > >> happen with? > > > > No. I've got 40 nodes total with this hardware configuration, and the > > problem has been seen on most/all nodes at one time or another. It > > doesn't seem, based on the limited number of observable parameters I'm > > aware of, to be dependent on the number of nodes involved. > > > > It is an intermittent problem, but when it happens, it happens at job > > launch, and it does occur most of the time. > > > > Thanks, > > > > V. Ram > > > >> --td > >>> > >>> Open MPI InfiniBand gurus and/or Mellanox: could I please get some > >>> assistance with this? Any suggestions on tunables or debugging > >>> parameters to try? > >>> > >>> Thank you very much. > >>> > >>> On Mon, Dec 12, 2011, at 10:42 AM, V. Ram wrote: > >>>> Hello, > >>>> > >>>> We are running a cluster that has a good number of older nodes with > >>>> Mellanox IB HCAs that have the "mthca" device name ("ib_mthca" kernel > >>>> module). > >>>> > >>>> These adapters are all at firmware level 4.8.917 . > >>>> > >>>> The Open MPI in use is 1.5.3 , kernel 2.6.39 , x86-64. Jobs are > >>>> launched/managed using Slurm 2.2.7. The IB software and drivers > >>>> correspond to OFED 1.5.3.2 , and I've verified that the kernel modules > >>>> in use are all from this OFED version. > >>>> > >>>> On nodes with the mthca hardware *only*, we get frequent, but > >>>> intermittent job startup failures, with messages like: > >>>> > >>>> ///////////////////////////////// > >>>> > >>>> [[19373,1],54][btl_openib_component.c:3320:handle_wc] from compute-c3-07 > >>>> to: compute-c3-01 error polling LP CQ with status RECEIVER NOT READY > >>>> RETRY EXCEEDED ERROR status > >>>> number 13 for wr_id 2a25c200 opcode 128 vendor error 135 qp_idx 0 > >>>> > >>>> -------------------------------------------------------------------------- > >>>> The OpenFabrics "receiver not ready" retry count on a per-peer > >>>> connection between two MPI processes has been exceeded. In general, > >>>> this should not happen because Open MPI uses flow control on per-peer > >>>> connections to ensure that receivers are always ready when data is > >>>> sent. > >>>> > >>>> [further standard error text snipped...] > >>>> > >>>> Below is some information about the host that raised the error and the > >>>> peer to which it was connected: > >>>> > >>>> Local host: compute-c3-07 > >>>> Local device: mthca0 > >>>> Peer host: compute-c3-01 > >>>> > >>>> You may need to consult with your system administrator to get this > >>>> problem fixed. > >>>> -------------------------------------------------------------------------- > >>>> > >>>> ///////////////////////////////// > >>>> > >>>> During these job runs, I have monitored the InfiniBand performance > >>>> counters on the endpoints and switch. No telltale counters for any of > >>>> these ports change during these failed job initiations. > >>>> > >>>> ibdiagnet works fine and properly enumerates the fabric and related > >>>> performance counters, both from the affected nodes, as well as other > >>>> nodes attached to the IB switch. The IB connectivity itself seems fine > >>>> from these nodes. > >>>> > >>>> Other nodes with different HCAs use the same InfiniBand fabric > >>>> continuously without any issue, so I don't think it's the fabric/switch. > >>>> > >>>> I'm at a loss for what to do next to try and find the root cause of the > >>>> issue. I suspect something perhaps having to do with the mthca > >>>> support/drivers, but how can I track this down further? > >>>> > >>>> Thank you, > >>>> > >>>> V. Ram. > > > > -- > > http://www.fastmail.fm - One of many happy users: > > http://www.fastmail.fm/docs/quotes.html > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- http://www.fastmail.fm - Faster than the air-speed velocity of an unladen european swallow