On Thu, Dec 15, 2011, at 09:28 PM, Jeff Squyres wrote:
> Very strange.  I have a lot of older mthca-based HCAs in my Cisco MPI
> test cluster, and I don't see these kinds of problems.

Can I ask what version of OFED you're using, or what version of OFED the
IB software stack is coming from?

Thank you.

V. Ram

> On Dec 15, 2011, at 7:24 PM, V. Ram wrote:
> 
> > Hi Terry,
> > 
> > Thanks so much for the response.  My replies are in-line below.
> > 
> > On Thu, Dec 15, 2011, at 07:00 AM, TERRY DONTJE wrote:
> >> IIRC, RNR's are usually due to the receiving side not having a segment 
> >> registered and ready to receive data on a QP.  The btl does go through a 
> >> big dance and does its own flow control to make sure this doesn't happen.
> >> 
> >> So when this happens are both the sending and receiving nodes using 
> >> mthca's to communicate with?
> > 
> > Yes.  For the newer nodes using onboard mlx4, this issue doesn't arise. 
> > The mlx4-based nodes are using the same core switch as the mthca nodes.
> > 
> >> By any chance is it a particular node (or pair of nodes) this seems to 
> >> happen with?
> > 
> > No.  I've got 40 nodes total with this hardware configuration, and the
> > problem has been seen on most/all nodes at one time or another.  It
> > doesn't seem, based on the limited number of observable parameters I'm
> > aware of, to be dependent on the number of nodes involved.
> > 
> > It is an intermittent problem, but when it happens, it happens at job
> > launch, and it does occur most of the time.
> > 
> > Thanks,
> > 
> > V. Ram
> > 
> >> --td
> >>> 
> >>> Open MPI InfiniBand gurus and/or Mellanox: could I please get some
> >>> assistance with this? Any suggestions on tunables or debugging
> >>> parameters to try?
> >>> 
> >>> Thank you very much.
> >>> 
> >>> On Mon, Dec 12, 2011, at 10:42 AM, V. Ram wrote:
> >>>> Hello,
> >>>> 
> >>>> We are running a cluster that has a good number of older nodes with
> >>>> Mellanox IB HCAs that have the "mthca" device name ("ib_mthca" kernel
> >>>> module).
> >>>> 
> >>>> These adapters are all at firmware level 4.8.917 .
> >>>> 
> >>>> The Open MPI in use is 1.5.3 , kernel 2.6.39 , x86-64. Jobs are
> >>>> launched/managed using Slurm 2.2.7. The IB software and drivers
> >>>> correspond to OFED 1.5.3.2 , and I've verified that the kernel modules
> >>>> in use are all from this OFED version.
> >>>> 
> >>>> On nodes with the mthca hardware *only*, we get frequent, but
> >>>> intermittent job startup failures, with messages like:
> >>>> 
> >>>> /////////////////////////////////
> >>>> 
> >>>> [[19373,1],54][btl_openib_component.c:3320:handle_wc] from compute-c3-07
> >>>> to: compute-c3-01 error polling LP CQ with status RECEIVER NOT READY
> >>>> RETRY EXCEEDED ERROR status
> >>>> number 13 for wr_id 2a25c200 opcode 128 vendor error 135 qp_idx 0
> >>>> 
> >>>> --------------------------------------------------------------------------
> >>>> The OpenFabrics "receiver not ready" retry count on a per-peer
> >>>> connection between two MPI processes has been exceeded. In general,
> >>>> this should not happen because Open MPI uses flow control on per-peer
> >>>> connections to ensure that receivers are always ready when data is
> >>>> sent.
> >>>> 
> >>>> [further standard error text snipped...]
> >>>> 
> >>>> Below is some information about the host that raised the error and the
> >>>> peer to which it was connected:
> >>>> 
> >>>> Local host: compute-c3-07
> >>>> Local device: mthca0
> >>>> Peer host: compute-c3-01
> >>>> 
> >>>> You may need to consult with your system administrator to get this
> >>>> problem fixed.
> >>>> --------------------------------------------------------------------------
> >>>> 
> >>>> /////////////////////////////////
> >>>> 
> >>>> During these job runs, I have monitored the InfiniBand performance
> >>>> counters on the endpoints and switch. No telltale counters for any of
> >>>> these ports change during these failed job initiations.
> >>>> 
> >>>> ibdiagnet works fine and properly enumerates the fabric and related
> >>>> performance counters, both from the affected nodes, as well as other
> >>>> nodes attached to the IB switch. The IB connectivity itself seems fine
> >>>> from these nodes.
> >>>> 
> >>>> Other nodes with different HCAs use the same InfiniBand fabric
> >>>> continuously without any issue, so I don't think it's the fabric/switch.
> >>>> 
> >>>> I'm at a loss for what to do next to try and find the root cause of the
> >>>> issue. I suspect something perhaps having to do with the mthca
> >>>> support/drivers, but how can I track this down further?
> >>>> 
> >>>> Thank you,
> >>>> 
> >>>> V. Ram.
> > 
> > -- 
> > http://www.fastmail.fm - One of many happy users:
> >  http://www.fastmail.fm/docs/quotes.html
> > 
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 

-- 
http://www.fastmail.fm - Faster than the air-speed velocity of an
                          unladen european swallow

Reply via email to