On Thu, Nov 21, 2013 at 9:09 AM, Lars Marowsky-Bree wrote: > On 2013-11-20T16:58:01, Gianluca Cecchi <gianluca.cec...@gmail.com> wrote: > >> Based on docs I thought that the timeout should be >> >> token x token_retransmits_before_loss_const > > No, the comments in the corosync.conf.example and man corosync.conf > should be pretty clear, I hope. Can you recommend which phrasing we > should improve?
I have not understood exact relationship between token and token_retransmits_before_loss_const. When one comes into play and when the other one... So perhaps the second one could be given more details. Or some web links > >> SO my current test config is: >> # diff corosync.conf corosync.conf.pre181113 >> 24,25c24 >> < #token: 5000 >> < token: 120000 > > A 120s node timeout? That is really, really long. Why is the backup tool > interfering with the scheduling of high priority processes so much? That > sounds like the real bug. In fact I inherited analysis for a previous production cluster and I'm setting up a test environment to demonstrate that one of the realistic outputs could well be that a cluster is not the right solution implemented because the underlying infra is not stable enough. I'm not given a great visibility for what is VMware and SAN details, but I'm stressing to get them. I saw sometimes disk latencies going at 8000milliseceonds.... ;-( SO another possible output could be to make a more reliable infra before going with cluster. I'm putting deliberately high values to see what happens and lower them step by step BTW: I remember in the past some thread with other having problems with Netbackup (or similar backup software ) using snapshot and that putting higher values solved the sporadic problems (possibly 20000 for token and 10 for retransmit but I couldn't find them ...) > >> Any comment? >> Any different strategies successfully used in similar environments >> where high latencies get in place at snapshot deletion when >> consolidate phase of disks is executed? > > A setup where a VM apparently can freeze for almost 120s is not suitable > for HA. > I see from previous logs that sometimes drbd disconnect and reconnect only after 30-40 seconds with default timeouts... Thanks for your inputs. Gianluca _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org