В Sun, 26 Oct 2014 10:51:13 +0200 Andrew <ni...@seti.kr.ua> пишет:
> 26.10.2014 08:32, Andrei Borzenkov пишет: > > В Sat, 25 Oct 2014 23:34:54 +0300 > > Andrew <ni...@seti.kr.ua> пишет: > > > >> 25.10.2014 22:34, Digimer пишет: > >>> On 25/10/14 03:32 PM, Andrew wrote: > >>>> Hi all. > >>>> > >>>> I use Percona as RA on cluster (nothing mission-critical, currently - > >>>> just zabbix data); today after restarting MySQL resource (crm resource > >>>> restart p_mysql) I've got a split brain state - MySQL for some reason > >>>> started first at ex-slave node, ex-master starts later (possibly I've > >>>> set too small timeout to shutdown - only 120s, but I'm not sure). > >>>> Your logs do not show resource restart - they show pacemaker restart on node2. > >>>> After restart resource on both nodes it seems like mysql replication was > >>>> ok - but then after ~50min it fails in split brain again for unknown > >>>> reason (no resource restart was noticed). > >>>> > >>>> In 'show replication status' there is an error in table caused by unique > >>>> index dup. > >>>> > >>>> So I have a questions: > >>>> 1) Which thing causes split brain, and how to avoid it in future? > >>> Cause: > >>> > >>> Logs? > >> ct 25 13:54:13 node2 crmd[29248]: notice: do_state_transition: State > >> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC > >> cause=C_FSA_INTERNAL origin=abort_transition_graph ] > >> Oct 25 13:54:13 node2 pengine[29247]: notice: unpack_config: On loss > >> of CCM Quorum: Ignore > >> Oct 25 13:54:13 node2 pengine[29247]: notice: unpack_rsc_op: Operation > >> monitor found resource p_pgsql:0 active in master mode on node1.cluster > >> Oct 25 13:54:13 node2 pengine[29247]: notice: unpack_rsc_op: Operation > >> monitor found resource p_mysql:1 active in master mode on node2.cluster > > That seems too late. The real cause is that resource was reported as > > being in master state on both nodes and this happened earlier. > This is a different resources (pgsql and mysql)/ > > >>> Prevent: > >>> > >>> Fencing (aka stonith). This is why fencing is required. > >> No node failure. Just daemon was restarted. > >> > > "Split brain" == loss of communication. It does not matter whether > > communication was lost because node failed or because daemon was not > > running. There is no way for surviving node to know, *why* > > communication was lost. > > > So how stonith will help in this case? Daemon will be restarted after > it's death if it occures during restart, and stonith will see alive > daemon... > > So what is the easiest split-brain solution? Just to stop daemons, and > copy all mysql data from good node to bad one? There is no split-brain visible in your log. Pacemaker on node2 was restarted, cleanly as far as I can tell, and reintegrated back in cluster. May be node1 "lost" node2, but that needs logs from node1. You probably misuse "split brain" in this case. Split-brain means - nodes lost communication with each other, so each node is unaware of in which state resources on other node are. Here "nodes" means corosync/pacemaker. Not individual resources. _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org