Re: [OMPI users] OpenIB problems

2007-11-28 Thread Ogden, Jeffry Brandon
For what it's worth Andrew, the RETRY_EXCEEDED_ERRORS can be caused by flaky hardware as well. The timeout value is probably best tuned relative to the size of your IB fabric. But if reliability is the biggest criteria, crank up the timemout value to 20. That's the best you can do. If it contin

Re: [OMPI users] OMPI launching problem using TM and openib on 1920 nodes

2006-10-20 Thread Ogden, Jeffry Brandon
bufsize=%d, buflen=%d, ct=%d)\n", > > Are you able to use OSC mpiexec to launch over the same number of > nodes, perchance? > > > On Oct 20, 2006, at 12:23 PM, Ogden, Jeffry Brandon wrote: > > > We are having quite a bit of trouble reliably launching larger

[OMPI users] OMPI launching problem using TM and openib on 1920 nodes

2006-10-20 Thread Ogden, Jeffry Brandon
We are having quite a bit of trouble reliably launching larger jobs (1920 nodes, 1 ppn) with OMPI (1.1.2rc4 with gcc) at the moment. The launches usually either just hang or fail with output like: Cbench numprocs: 1920 Cbench numnodes: 1921 Cbench ppn: 1 Cbench jobname: xhpl-1ppn-1920 Cbench jobl

[OMPI users] Default number of slots when using Torque

2006-04-28 Thread Ogden, Jeffry Brandon
How does the orterun launch determine the default number of slots per node when running in a Torque job? Is there debug output from orterun that will show me this? Thanks.