On Sep 13, 2011, at 6:33 PM, kevin.buck...@ecs.vuw.ac.nz wrote: > there have been two runs of jobs that invoked the mpirun using these > OpenMPI parameter setting flags (basically, these mimic what I have > in the global config file) > > -mca btl_openib_ib_timeout 20 -mca btl_openib_ib_min_rnr_timer 25 > > when both of the job failed, the error output was > > -----8<----------8<----------8<----------8<----------8<----- > Two MCA parameters can be used to control Open MPI's behavior with > respect to the retry count: > > * btl_openib_ib_retry_count - The number of times the sender will > attempt to retry (defaulted to 7, the maximum value). > * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted > to 10). The actual timeout value used is calculated as: > -----8<----------8<----------8<----------8<----------8<----- > > Note that the error output it still showing that mysterious "10" > in there for btl_openib_ib_timeout value.
That text message is hard-coded (and apparently out of date); it does not show the current value. I agree that that is misleading. This error message needs to be improved. > I have noticed that the same node is apearing in the error output > each time, so I'll try taking that one out of the test PE that the > jobs are executing in and seeing if I can tie it to hardware. This might suggest a hardware issue; let us know what you find. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/