Re: [OMPI users] Fault Tolerance & Behavior

2006-10-31 Thread Troy Telford
On Tue, 31 Oct 2006 08:43:10 -0700, Galen M. Shipman wrote: Okay, so these are percentage not modulus, the formula makes some sense now.. so the timeout is between 4.9 and 10.3 ms, you had better plug the cable in/out very quickly The Flash could do it. -- Troy Telford

Re: [OMPI users] Fault Tolerance & Behavior

2006-10-31 Thread Galen M. Shipman
Galen M. Shipman wrote: Gleb Natapov wrote: On Mon, Oct 30, 2006 at 11:45:53AM -0700, Troy Telford wrote: On Sun, 29 Oct 2006 01:34:06 -0700, Gleb Natapov wrote: If you use OB1 PML (default one) it will never recover from link down error no matter how many other tran

Re: [OMPI users] Fault Tolerance & Behavior

2006-10-31 Thread Galen M. Shipman
Gleb Natapov wrote: On Mon, Oct 30, 2006 at 11:45:53AM -0700, Troy Telford wrote: On Sun, 29 Oct 2006 01:34:06 -0700, Gleb Natapov wrote: If you use OB1 PML (default one) it will never recover from link down error no matter how many other transports you have. The reason is that OB1

Re: [OMPI users] Fault Tolerance & Behavior

2006-10-31 Thread Gleb Natapov
On Mon, Oct 30, 2006 at 11:45:53AM -0700, Troy Telford wrote: > On Sun, 29 Oct 2006 01:34:06 -0700, Gleb Natapov > wrote: > > > If you use OB1 PML (default one) it will never recover from link down > > error no matter how many other transports you have. The reason is that > > OB1 never tracks w

Re: [OMPI users] Fault Tolerance & Behavior

2006-10-30 Thread Troy Telford
On Sun, 29 Oct 2006 01:34:06 -0700, Gleb Natapov wrote: If you use OB1 PML (default one) it will never recover from link down error no matter how many other transports you have. The reason is that OB1 never tracks what happens with buffers submitted to BTL. So if BTL can't, for any reason, tr

Re: [OMPI users] Fault Tolerance & Behavior

2006-10-29 Thread Gleb Natapov
On Thu, Oct 26, 2006 at 05:39:13PM -0600, Troy Telford wrote: > I'm also confident that both TCP & Myrinet would throw an error when they > time out; it's just that I haven't felt the need to verify it. (And with > some-odd 20 minutes for Myrinet, it takes a bit of attention span. The > las

Re: [OMPI users] Fault Tolerance & Behavior

2006-10-26 Thread Troy Telford
On Thu, 26 Oct 2006 15:11:46 -0600, George Bosilca wrote: The Open MPI behavior is the same independently of the network used for the job. At least the behavior dictated by our internal message passing layer. Which is one of the things I like about Open MPI. There is nothing (that has a r

Re: [OMPI users] Fault Tolerance & Behavior

2006-10-26 Thread George Bosilca
Moreover ... you have to have the admin right in order to modify these parameters. If it's the case, there is a trick for MX too. One can recompile it, with a different timeout (recompilation is required as far as I remember). Grep for timeout in the MX sources and you will find out how to

Re: [OMPI users] Fault Tolerance & Behavior

2006-10-26 Thread Durga Choudhury
As an alternate suggestion (although George's is better, since this will affect your entire network connectivity), you could override the default TCP timeout values with the "sysctl -w" command. The following three OIDs affect TCP timeout behavior under Linux: net.ipv4.tcp_keepalive_intvl = 75 <-

Re: [OMPI users] Fault Tolerance & Behavior

2006-10-26 Thread George Bosilca
The Open MPI behavior is the same independently of the network used for the job. At least the behavior dictated by our internal message passing layer. But, for this to happens we should get a warning from the network that something is wrong (such a timeout). In the case of TCP (and Myrinet)