Very strange.  I have a lot of older mthca-based HCAs in my Cisco MPI test 
cluster, and I don't see these kinds of problems.

Mellanox -- any ideas?


On Dec 15, 2011, at 7:24 PM, V. Ram wrote:

> Hi Terry,
> 
> Thanks so much for the response.  My replies are in-line below.
> 
> On Thu, Dec 15, 2011, at 07:00 AM, TERRY DONTJE wrote:
>> IIRC, RNR's are usually due to the receiving side not having a segment 
>> registered and ready to receive data on a QP.  The btl does go through a 
>> big dance and does its own flow control to make sure this doesn't happen.
>> 
>> So when this happens are both the sending and receiving nodes using 
>> mthca's to communicate with?
> 
> Yes.  For the newer nodes using onboard mlx4, this issue doesn't arise. 
> The mlx4-based nodes are using the same core switch as the mthca nodes.
> 
>> By any chance is it a particular node (or pair of nodes) this seems to 
>> happen with?
> 
> No.  I've got 40 nodes total with this hardware configuration, and the
> problem has been seen on most/all nodes at one time or another.  It
> doesn't seem, based on the limited number of observable parameters I'm
> aware of, to be dependent on the number of nodes involved.
> 
> It is an intermittent problem, but when it happens, it happens at job
> launch, and it does occur most of the time.
> 
> Thanks,
> 
> V. Ram
> 
>> --td
>>> 
>>> Open MPI InfiniBand gurus and/or Mellanox: could I please get some
>>> assistance with this? Any suggestions on tunables or debugging
>>> parameters to try?
>>> 
>>> Thank you very much.
>>> 
>>> On Mon, Dec 12, 2011, at 10:42 AM, V. Ram wrote:
>>>> Hello,
>>>> 
>>>> We are running a cluster that has a good number of older nodes with
>>>> Mellanox IB HCAs that have the "mthca" device name ("ib_mthca" kernel
>>>> module).
>>>> 
>>>> These adapters are all at firmware level 4.8.917 .
>>>> 
>>>> The Open MPI in use is 1.5.3 , kernel 2.6.39 , x86-64. Jobs are
>>>> launched/managed using Slurm 2.2.7. The IB software and drivers
>>>> correspond to OFED 1.5.3.2 , and I've verified that the kernel modules
>>>> in use are all from this OFED version.
>>>> 
>>>> On nodes with the mthca hardware *only*, we get frequent, but
>>>> intermittent job startup failures, with messages like:
>>>> 
>>>> /////////////////////////////////
>>>> 
>>>> [[19373,1],54][btl_openib_component.c:3320:handle_wc] from compute-c3-07
>>>> to: compute-c3-01 error polling LP CQ with status RECEIVER NOT READY
>>>> RETRY EXCEEDED ERROR status
>>>> number 13 for wr_id 2a25c200 opcode 128 vendor error 135 qp_idx 0
>>>> 
>>>> --------------------------------------------------------------------------
>>>> The OpenFabrics "receiver not ready" retry count on a per-peer
>>>> connection between two MPI processes has been exceeded. In general,
>>>> this should not happen because Open MPI uses flow control on per-peer
>>>> connections to ensure that receivers are always ready when data is
>>>> sent.
>>>> 
>>>> [further standard error text snipped...]
>>>> 
>>>> Below is some information about the host that raised the error and the
>>>> peer to which it was connected:
>>>> 
>>>> Local host: compute-c3-07
>>>> Local device: mthca0
>>>> Peer host: compute-c3-01
>>>> 
>>>> You may need to consult with your system administrator to get this
>>>> problem fixed.
>>>> --------------------------------------------------------------------------
>>>> 
>>>> /////////////////////////////////
>>>> 
>>>> During these job runs, I have monitored the InfiniBand performance
>>>> counters on the endpoints and switch. No telltale counters for any of
>>>> these ports change during these failed job initiations.
>>>> 
>>>> ibdiagnet works fine and properly enumerates the fabric and related
>>>> performance counters, both from the affected nodes, as well as other
>>>> nodes attached to the IB switch. The IB connectivity itself seems fine
>>>> from these nodes.
>>>> 
>>>> Other nodes with different HCAs use the same InfiniBand fabric
>>>> continuously without any issue, so I don't think it's the fabric/switch.
>>>> 
>>>> I'm at a loss for what to do next to try and find the root cause of the
>>>> issue. I suspect something perhaps having to do with the mthca
>>>> support/drivers, but how can I track this down further?
>>>> 
>>>> Thank you,
>>>> 
>>>> V. Ram.
> 
> -- 
> http://www.fastmail.fm - One of many happy users:
>  http://www.fastmail.fm/docs/quotes.html
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to