I have recently seen some OpenIB time out errors and see the following reported:
* btl_openib_ib_retry_count - The number of times the sender will attempt to retry (defaulted to 7, the maximum value). * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted to 10). The actual timeout value used is calculated as: I'd like to confirm that, when those messages say "defaulted to", they are telling me what's happening on the node in question and not just what the default is. Reason for asking is that I believe that I am setting the values of btl_openib_ib_timeout to 20, globally, as suggested in areas of the OpenMPI docs but those messages, if they do report what's happening, might be telling me otherwise. In case it is relevant, the OpenMPI in question is the bog standard RHEL5 1.4.4. -- Kevin M. Buckley Room: CO327 School of Engineering and Phone: +64 4 463 5971 Computer Science Victoria University of Wellington New Zealand