On Sep 13, 2011, at 6:33 PM, kevin.buck...@ecs.vuw.ac.nz wrote:

> there have been two runs of jobs that invoked the mpirun using these
> OpenMPI parameter setting flags (basically, these mimic what I have
> in the global config file)
> 
> -mca btl_openib_ib_timeout 20 -mca btl_openib_ib_min_rnr_timer 25
> 
> when both of the job failed, the error output was
> 
> -----8<----------8<----------8<----------8<----------8<-----
> Two MCA parameters can be used to control Open MPI's behavior with
> respect to the retry count:
> 
> * btl_openib_ib_retry_count - The number of times the sender will
>  attempt to retry (defaulted to 7, the maximum value).
> * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
>  to 10).  The actual timeout value used is calculated as:
> -----8<----------8<----------8<----------8<----------8<-----
> 
> Note that the error output it still showing that mysterious "10"
> in there for btl_openib_ib_timeout value.

That text message is hard-coded (and apparently out of date); it does not show 
the current value.

I agree that that is misleading.  This error message needs to be improved.

> I have noticed that the same node is apearing in the error output
> each time, so I'll try taking that one out of the test PE that the
> jobs are executing in and seeing if I can tie it to hardware.

This might suggest a hardware issue; let us know what you find.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to