On 11/21/2013 06:26 AM, Gianluca Cecchi wrote:
On Thu, Nov 21, 2013 at 9:09 AM, Lars Marowsky-Bree wrote:
On 2013-11-20T16:58:01, Gianluca Cecchi <gianluca.cec...@gmail.com> wrote:

Based on docs  I thought that the timeout should be

token x token_retransmits_before_loss_const
No, the comments in the corosync.conf.example and man corosync.conf
should be pretty clear, I hope. Can you recommend which phrasing we
should improve?
I have not understood exact relationship between token and
token_retransmits_before_loss_const.
When one comes into play and when the other one...
So perhaps the second one could be given more details.
Or some web links

The token retransmit is a timer that is started each time a token is transmitted. This is the maximum timer that exists - it is not token * retransmits_before_loss_const.

The retrans_before_loss_const says "please transmit a replacement token x many times in the token period". Since the token is UDP, it could be lost in network overflow situations or other scenarios.

Using a real-world example
token: 10000
retrans_before_loss_const: 10

token will be retransmitted roughly every 1000 msec and the token will be determined lost after 10000msec.

Regards
-steve

SO my current test config is:
   # diff corosync.conf corosync.conf.pre181113
24,25c24
< #token: 5000
< token: 120000
A 120s node timeout? That is really, really long. Why is the backup tool
interfering with the scheduling of high priority processes so much? That
sounds like the real bug.
In fact I inherited analysis for a previous production cluster and I'm
setting up a test environment to demonstrate that one of the realistic
outputs could well be that a cluster is not the right solution
implemented because the underlying infra is not stable enough.
I'm not given a great visibility for what is VMware and SAN details,
but I'm stressing to get them.
I saw sometimes disk latencies going at 8000milliseceonds.... ;-(
SO another possible output could be to make a more reliable infra
before going with cluster.
I'm putting deliberately high values to see what happens and lower
them step by step
BTW: I remember in the past some thread with other having problems
with Netbackup (or similar backup software ) using snapshot and that
putting higher values solved the sporadic problems (possibly 20000 for
token and 10 for retransmit but I couldn't find them ...)


Any comment?
Any different strategies successfully used in similar environments
where high latencies get in place at snapshot deletion when
consolidate phase of disks is executed?
A setup where a VM apparently can freeze for almost 120s is not suitable
for HA.

I see from previous logs that sometimes drbd disconnect and reconnect
only after 30-40 seconds with default timeouts...

Thanks for your inputs.

Gianluca

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to