11.09.2014 05:57, Norbert Kiam Maclang wrote: > Is this something to do with quorum? But I already set
You'd need to configure fencing at the drbd resources level. http://www.drbd.org/users-guide-emb/s-pacemaker-fencing.html#s-pacemaker-fencing-cib > > property no-quorum-policy="ignore" \ > expected-quorum-votes="1" > > Thanks in advance, > Kiam > > On Thu, Sep 11, 2014 at 10:09 AM, Norbert Kiam Maclang > <norbert.kiam.macl...@gmail.com <mailto:norbert.kiam.macl...@gmail.com>> > wrote: > > Hi, > > Please help me understand what is causing the problem. I have a 2 > node cluster running on vms using KVM. Each vm (I am using Ubuntu > 14.04) runs on a separate hypervisor on separate machines. All are > working well during testing (I restarted the vms alternately), but > after a day when I kill the other node, I always end up corosync and > pacemaker hangs on the surviving node. Date and time on the vms are > in sync, I use unicast, tcpdump shows both nodes exchanges, > confirmed that DRBD is healthy and crm_mon show good status before I > kill the other node. Below are my configurations and versions I used: > > corosync 2.3.3-1ubuntu1 > crmsh 1.2.5+hg1034-1ubuntu3 > drbd8-utils 2:8.4.4-1ubuntu1 > libcorosync-common4 2.3.3-1ubuntu1 > libcrmcluster4 1.1.10+git20130802-1ubuntu2 > libcrmcommon3 1.1.10+git20130802-1ubuntu2 > libcrmservice1 1.1.10+git20130802-1ubuntu2 > pacemaker 1.1.10+git20130802-1ubuntu2 > pacemaker-cli-utils 1.1.10+git20130802-1ubuntu2 > postgresql-9.3 9.3.5-0ubuntu0.14.04.1 > > # /etc/corosync/corosync: > totem { > version: 2 > token: 3000 > token_retransmits_before_loss_const: 10 > join: 60 > consensus: 3600 > vsftype: none > max_messages: 20 > clear_node_high_bit: yes > secauth: off > threads: 0 > rrp_mode: none > interface { > member { > memberaddr: 10.2.136.56 > } > member { > memberaddr: 10.2.136.57 > } > ringnumber: 0 > bindnetaddr: 10.2.136.0 > mcastport: 5405 > } > transport: udpu > } > amf { > mode: disabled > } > quorum { > provider: corosync_votequorum > expected_votes: 1 > } > aisexec { > user: root > group: root > } > logging { > fileline: off > to_stderr: yes > to_logfile: no > to_syslog: yes > syslog_facility: daemon > debug: off > timestamp: on > logger_subsys { > subsys: AMF > debug: off > tags: enter|leave|trace1|trace2|trace3|trace4|trace6 > } > } > > # /etc/corosync/service.d/pcmk: > service { > name: pacemaker > ver: 1 > } > > /etc/drbd.d/global_common.conf: > global { > usage-count no; > } > > common { > net { > protocol C; > } > } > > # /etc/drbd.d/pg.res: > resource pg { > device /dev/drbd0; > disk /dev/vdb; > meta-disk internal; > startup { > wfc-timeout 15; > degr-wfc-timeout 60; > } > disk { > on-io-error detach; > resync-rate 40M; > } > on node01 { > address 10.2.136.56:7789 <http://10.2.136.56:7789>; > } > on node02 { > address 10.2.136.57:7789 <http://10.2.136.57:7789>; > } > net { > verify-alg md5; > after-sb-0pri discard-zero-changes; > after-sb-1pri discard-secondary; > after-sb-2pri disconnect; > } > } > > # Pacemaker configuration: > node $id="167938104" node01 > node $id="167938105" node02 > primitive drbd_pg ocf:linbit:drbd \ > params drbd_resource="pg" \ > op monitor interval="29s" role="Master" \ > op monitor interval="31s" role="Slave" > primitive fs_pg ocf:heartbeat:Filesystem \ > params device="/dev/drbd0" directory="/var/lib/postgresql/9.3/main" > fstype="ext4" > primitive ip_pg ocf:heartbeat:IPaddr2 \ > params ip="10.2.136.59" cidr_netmask="24" nic="eth0" > primitive lsb_pg lsb:postgresql > group PGServer fs_pg lsb_pg ip_pg > ms ms_drbd_pg drbd_pg \ > meta master-max="1" master-node-max="1" clone-max="2" > clone-node-max="1" notify="true" > colocation pg_on_drbd inf: PGServer ms_drbd_pg:Master > order pg_after_drbd inf: ms_drbd_pg:promote PGServer:start > property $id="cib-bootstrap-options" \ > dc-version="1.1.10-42f2063" \ > cluster-infrastructure="corosync" \ > stonith-enabled="false" \ > no-quorum-policy="ignore" > rsc_defaults $id="rsc-options" \ > resource-stickiness="100" > > # Logs on node01 > Sep 10 10:25:33 node01 crmd[1019]: notice: peer_update_callback: > Our peer on the DC is dead > Sep 10 10:25:33 node01 crmd[1019]: notice: do_state_transition: > State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION > cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ] > Sep 10 10:25:33 node01 crmd[1019]: notice: do_state_transition: > State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC > cause=C_FSA_INTERNAL origin=do_election_check ] > Sep 10 10:25:33 node01 corosync[940]: [TOTEM ] A new membership > (10.2.136.56:52 <http://10.2.136.56:52>) was formed. Members left: > 167938105 > Sep 10 10:25:45 node01 kernel: [74452.740024] d-con pg: PingAck did > not arrive in time. > Sep 10 10:25:45 node01 kernel: [74452.740169] d-con pg: peer( > Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( > UpToDate -> DUnknown ) > Sep 10 10:25:45 node01 kernel: [74452.740987] d-con pg: asender > terminated > Sep 10 10:25:45 node01 kernel: [74452.740999] d-con pg: Terminating > drbd_a_pg > Sep 10 10:25:45 node01 kernel: [74452.741235] d-con pg: Connection > closed > Sep 10 10:25:45 node01 kernel: [74452.741259] d-con pg: conn( > NetworkFailure -> Unconnected ) > Sep 10 10:25:45 node01 kernel: [74452.741260] d-con pg: receiver > terminated > Sep 10 10:25:45 node01 kernel: [74452.741261] d-con pg: Restarting > receiver thread > Sep 10 10:25:45 node01 kernel: [74452.741262] d-con pg: receiver > (re)started > Sep 10 10:25:45 node01 kernel: [74452.741269] d-con pg: conn( > Unconnected -> WFConnection ) > Sep 10 10:26:12 node01 lrmd[1016]: warning: child_timeout_callback: > drbd_pg_monitor_31000 process (PID 8445) timed out > Sep 10 10:26:12 node01 lrmd[1016]: warning: operation_finished: > drbd_pg_monitor_31000:8445 - timed out after 20000ms > Sep 10 10:26:12 node01 crmd[1019]: error: process_lrm_event: LRM > operation drbd_pg_monitor_31000 (30) Timed Out (timeout=20000ms) > Sep 10 10:26:32 node01 crmd[1019]: warning: cib_rsc_callback: > Resource update 23 failed: (rc=-62) Timer expired > Sep 10 10:27:03 node01 lrmd[1016]: warning: child_timeout_callback: > drbd_pg_monitor_31000 process (PID 8693) timed out > Sep 10 10:27:03 node01 lrmd[1016]: warning: operation_finished: > drbd_pg_monitor_31000:8693 - timed out after 20000ms > Sep 10 10:27:54 node01 lrmd[1016]: warning: child_timeout_callback: > drbd_pg_monitor_31000 process (PID 8938) timed out > Sep 10 10:27:54 node01 lrmd[1016]: warning: operation_finished: > drbd_pg_monitor_31000:8938 - timed out after 20000ms > Sep 10 10:28:33 node01 crmd[1019]: error: crm_timer_popped: > Integration Timer (I_INTEGRATED) just popped in state S_INTEGRATION! > (180000ms) > Sep 10 10:28:33 node01 crmd[1019]: warning: do_state_transition: > Progressed to state S_FINALIZE_JOIN after C_TIMER_POPPED > Sep 10 10:28:33 node01 crmd[1019]: warning: do_state_transition: 1 > cluster nodes failed to respond to the join offer. > Sep 10 10:28:33 node01 crmd[1019]: notice: crmd_join_phase_log: > join-1: node02=none > Sep 10 10:28:33 node01 crmd[1019]: notice: crmd_join_phase_log: > join-1: node01=welcomed > Sep 10 10:28:45 node01 lrmd[1016]: warning: child_timeout_callback: > drbd_pg_monitor_31000 process (PID 9185) timed out > Sep 10 10:28:45 node01 lrmd[1016]: warning: operation_finished: > drbd_pg_monitor_31000:9185 - timed out after 20000ms > Sep 10 10:29:36 node01 lrmd[1016]: warning: child_timeout_callback: > drbd_pg_monitor_31000 process (PID 9432) timed out > Sep 10 10:29:36 node01 lrmd[1016]: warning: operation_finished: > drbd_pg_monitor_31000:9432 - timed out after 20000ms > Sep 10 10:30:27 node01 lrmd[1016]: warning: child_timeout_callback: > drbd_pg_monitor_31000 process (PID 9680) timed out > Sep 10 10:30:27 node01 lrmd[1016]: warning: operation_finished: > drbd_pg_monitor_31000:9680 - timed out after 20000ms > Sep 10 10:31:18 node01 lrmd[1016]: warning: child_timeout_callback: > drbd_pg_monitor_31000 process (PID 9927) timed out > Sep 10 10:31:18 node01 lrmd[1016]: warning: operation_finished: > drbd_pg_monitor_31000:9927 - timed out after 20000ms > Sep 10 10:32:09 node01 lrmd[1016]: warning: child_timeout_callback: > drbd_pg_monitor_31000 process (PID 10174) timed out > Sep 10 10:32:09 node01 lrmd[1016]: warning: operation_finished: > drbd_pg_monitor_31000:10174 - timed out after 20000ms > > #crm_mon on node01 before I kill the other vm: > Stack: corosync > Current DC: node02 (167938104) - partition with quorum > Version: 1.1.10-42f2063 > 2 Nodes configured > 5 Resources configured > > Online: [ node01 node02 ] > > Resource Group: PGServer > fs_pg (ocf::heartbeat:Filesystem): Started node02 > lsb_pg (lsb:postgresql): Started node02 > ip_pg (ocf::heartbeat:IPaddr2): Started node02 > Master/Slave Set: ms_drbd_pg [drbd_pg] > Masters: [ node02 ] > Slaves: [ node01 ] > > Thank you, > Kiam > > > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org