Re: [OMPI users] Error launching w/ 1.5.3 on IB mthca nodes

2012-01-04 Thread V. Ram
Hello. Sorry for the delay in confirming the minimum load that would trigger the RnR error; the holidays here were a significant interruption. On Mon, Dec 19, 2011, at 03:30 PM, Yevgeny Kliteynik wrote: > What's the smallest number of nodes that are needed to reproduce this > problem? Does it ha

Re: [OMPI users] Error launching w/ 1.5.3 on IB mthca nodes

2011-12-20 Thread V. Ram
Hello, On Mon, Dec 19, 2011, at 03:30 PM, Yevgeny Kliteynik wrote: > Hi, > > What's the smallest number of nodes that are needed to reproduce this > problem? Does it happen with just two HCAs, one process per node? I believe so, but I will work with some users to verify this. > Let's get you to

Re: [OMPI users] Error launching w/ 1.5.3 on IB mthca nodes

2011-12-19 Thread Yevgeny Kliteynik
Hi, > By any chance is it a particular node (or pair of nodes) this seems to > happen with? No. I've got 40 nodes total with this hardware configuration, and the problem has been seen on most/all nodes at one time or another. It doesn't seem, based on the limited numb

Re: [OMPI users] Error launching w/ 1.5.3 on IB mthca nodes

2011-12-19 Thread TERRY DONTJE
On 12/19/2011 2:10 AM, V. Ram wrote: On Thu, Dec 15, 2011, at 09:28 PM, Jeff Squyres wrote: Very strange. I have a lot of older mthca-based HCAs in my Cisco MPI test cluster, and I don't see these kinds of problems. Can I ask what version of OFED you're using, or what version of OFED the IB so

Re: [OMPI users] Error launching w/ 1.5.3 on IB mthca nodes

2011-12-19 Thread V. Ram
On Thu, Dec 15, 2011, at 09:28 PM, Jeff Squyres wrote: > Very strange. I have a lot of older mthca-based HCAs in my Cisco MPI > test cluster, and I don't see these kinds of problems. Can I ask what version of OFED you're using, or what version of OFED the IB software stack is coming from? Thank

Re: [OMPI users] Error launching w/ 1.5.3 on IB mthca nodes

2011-12-19 Thread V. Ram
On Sun, Dec 18, 2011, at 11:39 AM, Yevgeny Kliteynik wrote: > On 16-Dec-11 4:28 AM, Jeff Squyres wrote: > > Very strange. I have a lot of older mthca-based HCAs in my Cisco MPI test > > cluster, and I don't see these kinds of problems. > > > > Mellanox -- any ideas? > > So if I understand it ri

Re: [OMPI users] Error launching w/ 1.5.3 on IB mthca nodes

2011-12-18 Thread Yevgeny Kliteynik
On 16-Dec-11 4:28 AM, Jeff Squyres wrote: > Very strange. I have a lot of older mthca-based HCAs in my Cisco MPI test > cluster, and I don't see these kinds of problems. > > Mellanox -- any ideas? So if I understand it right, you have a mixed cluster - some machines with ConnecX HCAs family (ml

Re: [OMPI users] Error launching w/ 1.5.3 on IB mthca nodes

2011-12-15 Thread Jeff Squyres
Very strange. I have a lot of older mthca-based HCAs in my Cisco MPI test cluster, and I don't see these kinds of problems. Mellanox -- any ideas? On Dec 15, 2011, at 7:24 PM, V. Ram wrote: > Hi Terry, > > Thanks so much for the response. My replies are in-line below. > > On Thu, Dec 15, 2

Re: [OMPI users] Error launching w/ 1.5.3 on IB mthca nodes

2011-12-15 Thread V. Ram
Hi Terry, Thanks so much for the response. My replies are in-line below. On Thu, Dec 15, 2011, at 07:00 AM, TERRY DONTJE wrote: > IIRC, RNR's are usually due to the receiving side not having a segment > registered and ready to receive data on a QP. The btl does go through a > big dance and do

Re: [OMPI users] Error launching w/ 1.5.3 on IB mthca nodes

2011-12-15 Thread TERRY DONTJE
IIRC, RNR's are usually due to the receiving side not having a segment registered and ready to receive data on a QP. The btl does go through a big dance and does its own flow control to make sure this doesn't happen. So when this happens are both the sending and receiving nodes using mthca's

Re: [OMPI users] Error launching w/ 1.5.3 on IB mthca nodes

2011-12-14 Thread V. Ram
Open MPI InfiniBand gurus and/or Mellanox: could I please get some assistance with this? Any suggestions on tunables or debugging parameters to try? Thank you very much. On Mon, Dec 12, 2011, at 10:42 AM, V. Ram wrote: > Hello, > > We are running a cluster that has a good number of older nodes

[OMPI users] Error launching w/ 1.5.3 on IB mthca nodes

2011-12-12 Thread V. Ram
Hello, We are running a cluster that has a good number of older nodes with Mellanox IB HCAs that have the "mthca" device name ("ib_mthca" kernel module). These adapters are all at firmware level 4.8.917 . The Open MPI in use is 1.5.3 , kernel 2.6.39 , x86-64. Jobs are launched/managed using Slu