Hello.
Sorry for the delay in confirming the minimum load that would trigger
the RnR error; the holidays here were a significant interruption.
On Mon, Dec 19, 2011, at 03:30 PM, Yevgeny Kliteynik wrote:
> What's the smallest number of nodes that are needed to reproduce this
> problem? Does it ha
Hello,
On Mon, Dec 19, 2011, at 03:30 PM, Yevgeny Kliteynik wrote:
> Hi,
>
> What's the smallest number of nodes that are needed to reproduce this
> problem? Does it happen with just two HCAs, one process per node?
I believe so, but I will work with some users to verify this.
> Let's get you to
Hi,
> By any chance is it a particular node (or pair of nodes) this seems to
> happen with?
No. I've got 40 nodes total with this hardware configuration, and the
problem has been seen on most/all nodes at one time or another. It
doesn't seem, based on the limited numb
On 12/19/2011 2:10 AM, V. Ram wrote:
On Thu, Dec 15, 2011, at 09:28 PM, Jeff Squyres wrote:
Very strange. I have a lot of older mthca-based HCAs in my Cisco MPI
test cluster, and I don't see these kinds of problems.
Can I ask what version of OFED you're using, or what version of OFED the
IB so
On Thu, Dec 15, 2011, at 09:28 PM, Jeff Squyres wrote:
> Very strange. I have a lot of older mthca-based HCAs in my Cisco MPI
> test cluster, and I don't see these kinds of problems.
Can I ask what version of OFED you're using, or what version of OFED the
IB software stack is coming from?
Thank
On Sun, Dec 18, 2011, at 11:39 AM, Yevgeny Kliteynik wrote:
> On 16-Dec-11 4:28 AM, Jeff Squyres wrote:
> > Very strange. I have a lot of older mthca-based HCAs in my Cisco MPI test
> > cluster, and I don't see these kinds of problems.
> >
> > Mellanox -- any ideas?
>
> So if I understand it ri
On 16-Dec-11 4:28 AM, Jeff Squyres wrote:
> Very strange. I have a lot of older mthca-based HCAs in my Cisco MPI test
> cluster, and I don't see these kinds of problems.
>
> Mellanox -- any ideas?
So if I understand it right, you have a mixed cluster - some
machines with ConnecX HCAs family (ml
Very strange. I have a lot of older mthca-based HCAs in my Cisco MPI test
cluster, and I don't see these kinds of problems.
Mellanox -- any ideas?
On Dec 15, 2011, at 7:24 PM, V. Ram wrote:
> Hi Terry,
>
> Thanks so much for the response. My replies are in-line below.
>
> On Thu, Dec 15, 2
Hi Terry,
Thanks so much for the response. My replies are in-line below.
On Thu, Dec 15, 2011, at 07:00 AM, TERRY DONTJE wrote:
> IIRC, RNR's are usually due to the receiving side not having a segment
> registered and ready to receive data on a QP. The btl does go through a
> big dance and do
IIRC, RNR's are usually due to the receiving side not having a segment
registered and ready to receive data on a QP. The btl does go through a
big dance and does its own flow control to make sure this doesn't happen.
So when this happens are both the sending and receiving nodes using
mthca's
Open MPI InfiniBand gurus and/or Mellanox: could I please get some
assistance with this? Any suggestions on tunables or debugging
parameters to try?
Thank you very much.
On Mon, Dec 12, 2011, at 10:42 AM, V. Ram wrote:
> Hello,
>
> We are running a cluster that has a good number of older nodes
Hello,
We are running a cluster that has a good number of older nodes with
Mellanox IB HCAs that have the "mthca" device name ("ib_mthca" kernel
module).
These adapters are all at firmware level 4.8.917 .
The Open MPI in use is 1.5.3 , kernel 2.6.39 , x86-64. Jobs are
launched/managed using Slu
12 matches
Mail list logo