On 16-Dec-11 4:28 AM, Jeff Squyres wrote: > Very strange. I have a lot of older mthca-based HCAs in my Cisco MPI test > cluster, and I don't see these kinds of problems. > > Mellanox -- any ideas?
So if I understand it right, you have a mixed cluster - some machines with ConnecX HCAs family (mlx4), and some with InfiniHost HCAs (mthca), and the problem arises only on machines with mthca. When exactly do you see this RNR problem: - when all the participating nodes are mthcas? - when the MPI job runs on both types of HCAs? -- YK > > On Dec 15, 2011, at 7:24 PM, V. Ram wrote: > >> Hi Terry, >> >> Thanks so much for the response. My replies are in-line below. >> >> On Thu, Dec 15, 2011, at 07:00 AM, TERRY DONTJE wrote: >>> IIRC, RNR's are usually due to the receiving side not having a segment >>> registered and ready to receive data on a QP. The btl does go through a >>> big dance and does its own flow control to make sure this doesn't happen. >>> >>> So when this happens are both the sending and receiving nodes using >>> mthca's to communicate with? >> >> Yes. For the newer nodes using onboard mlx4, this issue doesn't arise. >> The mlx4-based nodes are using the same core switch as the mthca nodes. >> >>> By any chance is it a particular node (or pair of nodes) this seems to >>> happen with? >> >> No. I've got 40 nodes total with this hardware configuration, and the >> problem has been seen on most/all nodes at one time or another. It >> doesn't seem, based on the limited number of observable parameters I'm >> aware of, to be dependent on the number of nodes involved. >> >> It is an intermittent problem, but when it happens, it happens at job >> launch, and it does occur most of the time. >> >> Thanks, >> >> V. Ram >> >>> --td >>>> >>>> Open MPI InfiniBand gurus and/or Mellanox: could I please get some >>>> assistance with this? Any suggestions on tunables or debugging >>>> parameters to try? >>>> >>>> Thank you very much. >>>> >>>> On Mon, Dec 12, 2011, at 10:42 AM, V. Ram wrote: >>>>> Hello, >>>>> >>>>> We are running a cluster that has a good number of older nodes with >>>>> Mellanox IB HCAs that have the "mthca" device name ("ib_mthca" kernel >>>>> module). >>>>> >>>>> These adapters are all at firmware level 4.8.917 . >>>>> >>>>> The Open MPI in use is 1.5.3 , kernel 2.6.39 , x86-64. Jobs are >>>>> launched/managed using Slurm 2.2.7. The IB software and drivers >>>>> correspond to OFED 1.5.3.2 , and I've verified that the kernel modules >>>>> in use are all from this OFED version. >>>>> >>>>> On nodes with the mthca hardware *only*, we get frequent, but >>>>> intermittent job startup failures, with messages like: >>>>> >>>>> ///////////////////////////////// >>>>> >>>>> [[19373,1],54][btl_openib_component.c:3320:handle_wc] from compute-c3-07 >>>>> to: compute-c3-01 error polling LP CQ with status RECEIVER NOT READY >>>>> RETRY EXCEEDED ERROR status >>>>> number 13 for wr_id 2a25c200 opcode 128 vendor error 135 qp_idx 0 >>>>> >>>>> -------------------------------------------------------------------------- >>>>> The OpenFabrics "receiver not ready" retry count on a per-peer >>>>> connection between two MPI processes has been exceeded. In general, >>>>> this should not happen because Open MPI uses flow control on per-peer >>>>> connections to ensure that receivers are always ready when data is >>>>> sent. >>>>> >>>>> [further standard error text snipped...] >>>>> >>>>> Below is some information about the host that raised the error and the >>>>> peer to which it was connected: >>>>> >>>>> Local host: compute-c3-07 >>>>> Local device: mthca0 >>>>> Peer host: compute-c3-01 >>>>> >>>>> You may need to consult with your system administrator to get this >>>>> problem fixed. >>>>> -------------------------------------------------------------------------- >>>>> >>>>> ///////////////////////////////// >>>>> >>>>> During these job runs, I have monitored the InfiniBand performance >>>>> counters on the endpoints and switch. No telltale counters for any of >>>>> these ports change during these failed job initiations. >>>>> >>>>> ibdiagnet works fine and properly enumerates the fabric and related >>>>> performance counters, both from the affected nodes, as well as other >>>>> nodes attached to the IB switch. The IB connectivity itself seems fine >>>>> from these nodes. >>>>> >>>>> Other nodes with different HCAs use the same InfiniBand fabric >>>>> continuously without any issue, so I don't think it's the fabric/switch. >>>>> >>>>> I'm at a loss for what to do next to try and find the root cause of the >>>>> issue. I suspect something perhaps having to do with the mthca >>>>> support/drivers, but how can I track this down further? >>>>> >>>>> Thank you, >>>>> >>>>> V. Ram. >> >> -- >> http://www.fastmail.fm - One of many happy users: >> http://www.fastmail.fm/docs/quotes.html >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > >