On 14-Sep-11 12:59 PM, Jeff Squyres wrote: > On Sep 13, 2011, at 6:33 PM, kevin.buck...@ecs.vuw.ac.nz wrote: > >> there have been two runs of jobs that invoked the mpirun using these >> OpenMPI parameter setting flags (basically, these mimic what I have >> in the global config file) >> >> -mca btl_openib_ib_timeout 20 -mca btl_openib_ib_min_rnr_timer 25 >> >> when both of the job failed, the error output was >> >> -----8<----------8<----------8<----------8<----------8<----- >> Two MCA parameters can be used to control Open MPI's behavior with >> respect to the retry count: >> >> * btl_openib_ib_retry_count - The number of times the sender will >> attempt to retry (defaulted to 7, the maximum value). >> * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted >> to 10). The actual timeout value used is calculated as: >> -----8<----------8<----------8<----------8<----------8<----- >> >> Note that the error output it still showing that mysterious "10" >> in there for btl_openib_ib_timeout value. > > That text message is hard-coded (and apparently out of date); it does not > show the current value. > > I agree that that is misleading. This error message needs to be improved.
Indeed, this error message is out of date. It has the right value in OMPI 1.5 and trunk, but not in 1.4 series. -- YK >> I have noticed that the same node is apearing in the error output >> each time, so I'll try taking that one out of the test PE that the >> jobs are executing in and seeing if I can tie it to hardware. > > This might suggest a hardware issue; let us know what you find. >