On Mon, Feb 11, 2013 at 12:41 PM, Carlos Xavier <cbas...@connection.com.br> wrote: > Hi Andrew, > > tank you very much for your hints. > >> > Hi. >> > >> > We are running two clusters compounded of two machines. We are using DRBD >> > + OCFS2 to make the common >> filesystem. > > [snip] > >> > >> > The clusters run nice with normal load except when doing backup of >> > files or optimize of the databases. At this time we got a huge increment >> > of data coming by the >> mysqldump to the backup resource or from the resource mounted on /export. >> > Sometimes when performing the backup or optimizing the database (done >> > just on the mysql cluster), the Pacemaker declares a node dead (but >> > its not) >> >> Well you know that, but it doesn't :) >> It just knows it can't talk to its peer anymore - which it has to treat as a >> failure. >> >> > and start the recovering process. When it happens we end up with two >> > machines getting restarted and most of the times with a database crash >> > :-( >> > >> > As you can see below, just about 30 seconds after the dump starts on diana >> > the problem happens. >> > ---------------------------------------------------------------- > > [snip] > >> > 04:27:31 diana lrmd: [2919]: info: RA output: (httpd:1:monitor:stderr) >> > redirecting to systemctl Feb 6 04:28:31 diana lrmd: [2919]: info: RA >> > output: (httpd:1:monitor:stderr) redirecting to systemctl Feb 6 >> > 04:29:31 diana lrmd: [2919]: info: RA output: (httpd:1:monitor:stderr) >> > redirecting to systemctl Feb 6 04:30:01 diana /USR/SBIN/CRON[1257]: >> > (root) CMD (/root/scripts/bkp_database_diario.sh) >> > Feb 6 04:30:31 diana lrmd: [2919]: info: RA output: >> > (httpd:1:monitor:stderr) redirecting to systemctl Feb 6 04:31:31 >> > diana lrmd: [2919]: info: RA output: (httpd:1:monitor:stderr) >> > redirecting to systemctl Feb 6 04:31:42 diana lrmd: [2919]: WARN: >> > ip_intranet:0:monitor process >> (PID 1902) timed out (try 1). Killing with signal SIGTERM (15). >> >> I'd increase the timeout here. Or put pacemaker into maintenance mode (where >> it will not act on >> failures) while you do the backups - but thats more dangerous. >> >> > Feb 6 04:31:47 diana corosync[2902]: [CLM ] CLM CONFIGURATION CHANGE >> > Feb 6 04:31:47 diana corosync[2902]: [CLM ] New Configuration: >> > Feb 6 04:31:47 diana corosync[2902]: [CLM ] #011r(0) ip(10.10.1.2) >> > r(1) ip(10.10.10.9) >> > Feb 6 04:31:47 diana corosync[2902]: [CLM ] Members Left: >> > Feb 6 04:31:47 diana corosync[2902]: [CLM ] #011r(0) ip(10.10.1.1) >> > r(1) ip(10.10.10.8) >> > Feb 6 04:31:47 diana corosync[2902]: [CLM ] Members Joined: >> > >> >> This appears to be the (almost) root of your problem. >> The load is staving corosync of CPU (or possibly network bandwidth) and it >> can no longer talk to its >> peer. >> Corosync then informs pacemaker who initiates recovery. >> >> I'd start by tuning some of your timeout values in corosync.conf >> > > It should be the CPU, because I can see it going to 100% of usage on the > cacti graph. > Also we got two rings for corosync, one affected by the data flow ate the > backup time and another with free badwidth. > > This is the totem session of my configuration. > > totem { > version: 2 > token: 5000 > token_retransmits_before_loss_const: 10 > join: 60 > consensus: 6000 > vsftype: none > max_messages: 20 > clear_node_high_bit: yes > secauth: off > threads: 0 > rrp_mode: active > interface { > ringnumber: 0 > bindnetaddr: 10.10.1.0 > mcastaddr: 226.94.1.1 > mcastport: 5406 > ttl: 1 > } > interface { > ringnumber: 1 > bindnetaddr: 10.10.10.0 > mcastaddr: 226.94.1.1 > mcastport: 5406 > ttl: 1 > } > } > > Can you kindly point what timer/counter should I play with?
I would start by making these higher, perhaps double them and see what effect it has. token: 5000 token_retransmits_before_loss_const: 10 > What are the reasonable values for them? I got scared with this warning "It > is not recommended to alter this value without guidance > from the corosync community." > Is there any benefits of changing the rrp_mode from active to passive? Not something I've played with, sorry. > Should it be done on both hosts? It should be the same I would imagine. > >> > ---------------------------------------------------------------- >> > >> > Feb 6 04:30:32 apolo lrmd: [2855]: info: RA output: >> > (httpd:0:monitor:stderr) redirecting to systemctl Feb 6 04:31:32 >> > apolo lrmd: [2855]: info: RA output: (httpd:0:monitor:stderr) redirecting >> > to systemctl Feb 6 >> 04:31:41 apolo corosync[2848]: [TOTEM ] A processor failed, forming new >> configuration. >> > Feb 6 04:31:47 apolo corosync[2848]: [CLM ] CLM CONFIGURATION CHANGE >> > Feb 6 04:31:47 apolo corosync[2848]: [CLM ] New Configuration: >> > Feb 6 04:31:47 apolo corosync[2848]: [CLM ] #011r(0) ip(10.10.1.1) >> > r(1) ip(10.10.10.8) >> > Feb 6 04:31:47 apolo corosync[2848]: [CLM ] Members Left: >> > Feb 6 04:31:47 apolo corosync[2848]: [CLM ] #011r(0) ip(10.10.1.2) >> > r(1) ip(10.10.10.9) >> > Feb 6 04:31:47 apolo corosync[2848]: [CLM ] Members Joined: >> > Feb 6 04:31:47 apolo corosync[2848]: [pcmk ] notice: >> > pcmk_peer_update: Transitional membership event on ring 304: memb=1, >> > new=0, >> > lost=1 > > [snip] > >> > >> > After lots of log apolo asks diana to reboot and sometime after that it >> > got rebooted too. >> > We had an old cluster with heartbeat and DRBD used to cause it on that >> > system but now looks like >> Pacemaker is the guilt. >> > >> > Here is my Pacemaker and DRBD configuration >> > http://www2.connection.com.br/cbastos/pacemaker/crm_config >> > http://www2.connection.com.br/cbastos/pacemaker/drbd_conf/global_commo >> > n.setup >> > http://www2.connection.com.br/cbastos/pacemaker/drbd_conf/backup.res >> > http://www2.connection.com.br/cbastos/pacemaker/drbd_conf/export.res >> > >> > And more detailed logs >> > http://www2.connection.com.br/cbastos/pacemaker/reboot_apolo >> > http://www2.connection.com.br/cbastos/pacemaker/reboot_diana >> > > > Best regards, > Carlos. > > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org