Re: [OMPI users] random IB failures when running medium core counts

2010-08-30 Thread Joshua Bernstein
Hello Brock, While it doesn't solve the problem, have you tried increasing the btl timeouts like the message suggest? With 1884 cores in use perhaps there is some over subscription in the fabric? -Joshua Bernstein Penguin Computing Brock Palen wrote: We recently installed a modest IB networ

[OMPI users] random IB failures when running medium core counts

2010-08-30 Thread Brock Palen
We recently installed a modest IB network to our cluster, When running a 1884 core IB HPL job after a run we will get an error about IB, it does not always happen in the same place, some iterations will pass others will fail the error is below, we are using openmpi/1.4.2 with the intel 11 compi