For what it's worth Andrew, the RETRY_EXCEEDED_ERRORS can be caused by
flaky hardware as well. The timeout value is probably best tuned
relative to the size of your IB fabric. But if reliability is the
biggest criteria, crank up the timemout value to 20. That's the best
you can do. If it contin
bufsize=%d, buflen=%d, ct=%d)\n",
>
> Are you able to use OSC mpiexec to launch over the same number of
> nodes, perchance?
>
>
> On Oct 20, 2006, at 12:23 PM, Ogden, Jeffry Brandon wrote:
>
> > We are having quite a bit of trouble reliably launching larger
We are having quite a bit of trouble reliably launching larger jobs
(1920 nodes, 1 ppn) with OMPI (1.1.2rc4 with gcc) at the moment. The
launches usually either just hang or fail with output like:
Cbench numprocs: 1920
Cbench numnodes: 1921
Cbench ppn: 1
Cbench jobname: xhpl-1ppn-1920
Cbench jobl
How does the orterun launch determine the default number of slots per
node when running in a Torque job? Is there debug output from orterun
that will show me this?
Thanks.