On Sun, Dec 18, 2011, at 11:39 AM, Yevgeny Kliteynik wrote:
> On 16-Dec-11 4:28 AM, Jeff Squyres wrote:
> > Very strange.  I have a lot of older mthca-based HCAs in my Cisco MPI test 
> > cluster, and I don't see these kinds of problems.
> > 
> > Mellanox -- any ideas?
> 
> So if I understand it right, you have a mixed cluster - some
> machines with ConnecX HCAs family (mlx4), and some with InfiniHost
> HCAs (mthca), and the problem arises only on machines with mthca.

Yes.

> When exactly do you see this RNR problem:
>  - when all the participating nodes are mthcas?

Yes.

>  - when the MPI job runs on both types of HCAs?

We have actually seen the same problem as the user who sent the
following:
  http://www.open-mpi.org/community/lists/users/2011/06/16773.php
so we don't bother trying to run jobs on heterogeneous hardware.

Our Slurm partitions are defined by hardware type, and we do not allow
users to run jobs across different hardware types using InfiniBand.  If
they want to run embarrassingly parallel jobs across different hardware
types, we mandate that they use Ethernet only (which does work as
expected).

Thank you,

V. Ram

> -- YK
> 
>  
> > 
> > On Dec 15, 2011, at 7:24 PM, V. Ram wrote:
> > 
> >> Hi Terry,
> >>
> >> Thanks so much for the response.  My replies are in-line below.
> >>
> >> On Thu, Dec 15, 2011, at 07:00 AM, TERRY DONTJE wrote:
> >>> IIRC, RNR's are usually due to the receiving side not having a segment
> >>> registered and ready to receive data on a QP.  The btl does go through a
> >>> big dance and does its own flow control to make sure this doesn't happen.
> >>>
> >>> So when this happens are both the sending and receiving nodes using
> >>> mthca's to communicate with?
> >>
> >> Yes.  For the newer nodes using onboard mlx4, this issue doesn't arise.
> >> The mlx4-based nodes are using the same core switch as the mthca nodes.
> >>
> >>> By any chance is it a particular node (or pair of nodes) this seems to
> >>> happen with?
> >>
> >> No.  I've got 40 nodes total with this hardware configuration, and the
> >> problem has been seen on most/all nodes at one time or another.  It
> >> doesn't seem, based on the limited number of observable parameters I'm
> >> aware of, to be dependent on the number of nodes involved.
> >>
> >> It is an intermittent problem, but when it happens, it happens at job
> >> launch, and it does occur most of the time.
> >>
> >> Thanks,
> >>
> >> V. Ram
> >>
> >>> --td
> >>>>
> >>>> Open MPI InfiniBand gurus and/or Mellanox: could I please get some
> >>>> assistance with this? Any suggestions on tunables or debugging
> >>>> parameters to try?
> >>>>
> >>>> Thank you very much.
> >>>>
> >>>> On Mon, Dec 12, 2011, at 10:42 AM, V. Ram wrote:
> >>>>> Hello,
> >>>>>
> >>>>> We are running a cluster that has a good number of older nodes with
> >>>>> Mellanox IB HCAs that have the "mthca" device name ("ib_mthca" kernel
> >>>>> module).
> >>>>>
> >>>>> These adapters are all at firmware level 4.8.917 .
> >>>>>
> >>>>> The Open MPI in use is 1.5.3 , kernel 2.6.39 , x86-64. Jobs are
> >>>>> launched/managed using Slurm 2.2.7. The IB software and drivers
> >>>>> correspond to OFED 1.5.3.2 , and I've verified that the kernel modules
> >>>>> in use are all from this OFED version.
> >>>>>
> >>>>> On nodes with the mthca hardware *only*, we get frequent, but
> >>>>> intermittent job startup failures, with messages like:
> >>>>>
> >>>>> /////////////////////////////////
> >>>>>
> >>>>> [[19373,1],54][btl_openib_component.c:3320:handle_wc] from compute-c3-07
> >>>>> to: compute-c3-01 error polling LP CQ with status RECEIVER NOT READY
> >>>>> RETRY EXCEEDED ERROR status
> >>>>> number 13 for wr_id 2a25c200 opcode 128 vendor error 135 qp_idx 0
> >>>>>
> >>>>> --------------------------------------------------------------------------
> >>>>> The OpenFabrics "receiver not ready" retry count on a per-peer
> >>>>> connection between two MPI processes has been exceeded. In general,
> >>>>> this should not happen because Open MPI uses flow control on per-peer
> >>>>> connections to ensure that receivers are always ready when data is
> >>>>> sent.
> >>>>>
> >>>>> [further standard error text snipped...]
> >>>>>
> >>>>> Below is some information about the host that raised the error and the
> >>>>> peer to which it was connected:
> >>>>>
> >>>>> Local host: compute-c3-07
> >>>>> Local device: mthca0
> >>>>> Peer host: compute-c3-01
> >>>>>
> >>>>> You may need to consult with your system administrator to get this
> >>>>> problem fixed.
> >>>>> --------------------------------------------------------------------------
> >>>>>
> >>>>> /////////////////////////////////
> >>>>>
> >>>>> During these job runs, I have monitored the InfiniBand performance
> >>>>> counters on the endpoints and switch. No telltale counters for any of
> >>>>> these ports change during these failed job initiations.
> >>>>>
> >>>>> ibdiagnet works fine and properly enumerates the fabric and related
> >>>>> performance counters, both from the affected nodes, as well as other
> >>>>> nodes attached to the IB switch. The IB connectivity itself seems fine
> >>>>> from these nodes.
> >>>>>
> >>>>> Other nodes with different HCAs use the same InfiniBand fabric
> >>>>> continuously without any issue, so I don't think it's the fabric/switch.
> >>>>>
> >>>>> I'm at a loss for what to do next to try and find the root cause of the
> >>>>> issue. I suspect something perhaps having to do with the mthca
> >>>>> support/drivers, but how can I track this down further?
> >>>>>
> >>>>> Thank you,
> >>>>>
> >>>>> V. Ram.
> >>
> >> -- 
> >> http://www.fastmail.fm - One of many happy users:
> >>   http://www.fastmail.fm/docs/quotes.html
> >>
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > 
> > 
> 
> 

-- 
http://www.fastmail.fm - Does exactly what it says on the tin

Reply via email to