Hi, After adding resource level fencing on drbd, I still ended up having problems with timeouts on drbd. Is there a recommended settings for this? I followed what is written in the drbd documentation - http://www.drbd.org/users-guide-emb/s-pacemaker-crm-drbd-backed-service.html , Another thing I can't understand is why during initial tests, even I reboot the vms several times, failover works. But after I soak it for a couple of hours (say for example 8 hours or more) and continue with the tests, it will not failover and experience split brain. I confirmed it though that everything is healthy before performing a reboot. Disk health and network is good, drbd is synced, time beetween servers is good.
# Logs: node01 lrmd[1036]: warning: child_timeout_callback: drbd_pg_monitor_29000 process (PID 27744) timed out node01 lrmd[1036]: warning: operation_finished: drbd_pg_monitor_29000:27744 - timed out after 20000ms node01 crmd[1039]: error: process_lrm_event: LRM operation drbd_pg_monitor_29000 (69) Timed Out (timeout=20000ms) node01 crmd[1039]: warning: update_failcount: Updating failcount for drbd_pg on tyo1mqdb01p after failed monitor: rc=1 (update=value++, time=1410486352) Thanks, Kiam On Thu, Sep 11, 2014 at 6:58 PM, Norbert Kiam Maclang < norbert.kiam.macl...@gmail.com> wrote: > Thank you Vladislav. > > I have configured resource level fencing on drbd and removed wfc-timeout > and defr-wfc-timeout (is this required?). My drbd configuration is now: > > resource pg { > device /dev/drbd0; > disk /dev/vdb; > meta-disk internal; > disk { > fencing resource-only; > on-io-error detach; > resync-rate 40M; > } > handlers { > fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; > after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh"; > split-brain "/usr/lib/drbd/notify-split-brain.sh nkbm"; > } > on node01 { > address 10.2.136.52:7789; > } > on node02 { > address 10.2.136.55:7789; > } > net { > verify-alg md5; > after-sb-0pri discard-zero-changes; > after-sb-1pri discard-secondary; > after-sb-2pri disconnect; > } > } > > Failover works on my initial test (restarting both nodes alternately - > this always works). Will wait for a couple of hours after doing a failover > test again (Which always fail on my previous setup). > > Thank you! > Kiam > > On Thu, Sep 11, 2014 at 2:14 PM, Vladislav Bogdanov <bub...@hoster-ok.com> > wrote: > >> 11.09.2014 05:57, Norbert Kiam Maclang wrote: >> > Is this something to do with quorum? But I already set >> >> You'd need to configure fencing at the drbd resources level. >> >> >> http://www.drbd.org/users-guide-emb/s-pacemaker-fencing.html#s-pacemaker-fencing-cib >> >> >> > >> > property no-quorum-policy="ignore" \ >> > expected-quorum-votes="1" >> > >> > Thanks in advance, >> > Kiam >> > >> > On Thu, Sep 11, 2014 at 10:09 AM, Norbert Kiam Maclang >> > <norbert.kiam.macl...@gmail.com <mailto:norbert.kiam.macl...@gmail.com >> >> >> > wrote: >> > >> > Hi, >> > >> > Please help me understand what is causing the problem. I have a 2 >> > node cluster running on vms using KVM. Each vm (I am using Ubuntu >> > 14.04) runs on a separate hypervisor on separate machines. All are >> > working well during testing (I restarted the vms alternately), but >> > after a day when I kill the other node, I always end up corosync and >> > pacemaker hangs on the surviving node. Date and time on the vms are >> > in sync, I use unicast, tcpdump shows both nodes exchanges, >> > confirmed that DRBD is healthy and crm_mon show good status before I >> > kill the other node. Below are my configurations and versions I >> used: >> > >> > corosync 2.3.3-1ubuntu1 >> > crmsh 1.2.5+hg1034-1ubuntu3 >> > drbd8-utils 2:8.4.4-1ubuntu1 >> > libcorosync-common4 2.3.3-1ubuntu1 >> > libcrmcluster4 1.1.10+git20130802-1ubuntu2 >> > libcrmcommon3 1.1.10+git20130802-1ubuntu2 >> > libcrmservice1 1.1.10+git20130802-1ubuntu2 >> > pacemaker 1.1.10+git20130802-1ubuntu2 >> > pacemaker-cli-utils 1.1.10+git20130802-1ubuntu2 >> > postgresql-9.3 9.3.5-0ubuntu0.14.04.1 >> > >> > # /etc/corosync/corosync: >> > totem { >> > version: 2 >> > token: 3000 >> > token_retransmits_before_loss_const: 10 >> > join: 60 >> > consensus: 3600 >> > vsftype: none >> > max_messages: 20 >> > clear_node_high_bit: yes >> > secauth: off >> > threads: 0 >> > rrp_mode: none >> > interface { >> > member { >> > memberaddr: 10.2.136.56 >> > } >> > member { >> > memberaddr: 10.2.136.57 >> > } >> > ringnumber: 0 >> > bindnetaddr: 10.2.136.0 >> > mcastport: 5405 >> > } >> > transport: udpu >> > } >> > amf { >> > mode: disabled >> > } >> > quorum { >> > provider: corosync_votequorum >> > expected_votes: 1 >> > } >> > aisexec { >> > user: root >> > group: root >> > } >> > logging { >> > fileline: off >> > to_stderr: yes >> > to_logfile: no >> > to_syslog: yes >> > syslog_facility: daemon >> > debug: off >> > timestamp: on >> > logger_subsys { >> > subsys: AMF >> > debug: off >> > tags: enter|leave|trace1|trace2|trace3|trace4|trace6 >> > } >> > } >> > >> > # /etc/corosync/service.d/pcmk: >> > service { >> > name: pacemaker >> > ver: 1 >> > } >> > >> > /etc/drbd.d/global_common.conf: >> > global { >> > usage-count no; >> > } >> > >> > common { >> > net { >> > protocol C; >> > } >> > } >> > >> > # /etc/drbd.d/pg.res: >> > resource pg { >> > device /dev/drbd0; >> > disk /dev/vdb; >> > meta-disk internal; >> > startup { >> > wfc-timeout 15; >> > degr-wfc-timeout 60; >> > } >> > disk { >> > on-io-error detach; >> > resync-rate 40M; >> > } >> > on node01 { >> > address 10.2.136.56:7789 <http://10.2.136.56:7789>; >> > } >> > on node02 { >> > address 10.2.136.57:7789 <http://10.2.136.57:7789>; >> > } >> > net { >> > verify-alg md5; >> > after-sb-0pri discard-zero-changes; >> > after-sb-1pri discard-secondary; >> > after-sb-2pri disconnect; >> > } >> > } >> > >> > # Pacemaker configuration: >> > node $id="167938104" node01 >> > node $id="167938105" node02 >> > primitive drbd_pg ocf:linbit:drbd \ >> > params drbd_resource="pg" \ >> > op monitor interval="29s" role="Master" \ >> > op monitor interval="31s" role="Slave" >> > primitive fs_pg ocf:heartbeat:Filesystem \ >> > params device="/dev/drbd0" directory="/var/lib/postgresql/9.3/main" >> > fstype="ext4" >> > primitive ip_pg ocf:heartbeat:IPaddr2 \ >> > params ip="10.2.136.59" cidr_netmask="24" nic="eth0" >> > primitive lsb_pg lsb:postgresql >> > group PGServer fs_pg lsb_pg ip_pg >> > ms ms_drbd_pg drbd_pg \ >> > meta master-max="1" master-node-max="1" clone-max="2" >> > clone-node-max="1" notify="true" >> > colocation pg_on_drbd inf: PGServer ms_drbd_pg:Master >> > order pg_after_drbd inf: ms_drbd_pg:promote PGServer:start >> > property $id="cib-bootstrap-options" \ >> > dc-version="1.1.10-42f2063" \ >> > cluster-infrastructure="corosync" \ >> > stonith-enabled="false" \ >> > no-quorum-policy="ignore" >> > rsc_defaults $id="rsc-options" \ >> > resource-stickiness="100" >> > >> > # Logs on node01 >> > Sep 10 10:25:33 node01 crmd[1019]: notice: peer_update_callback: >> > Our peer on the DC is dead >> > Sep 10 10:25:33 node01 crmd[1019]: notice: do_state_transition: >> > State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION >> > cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ] >> > Sep 10 10:25:33 node01 crmd[1019]: notice: do_state_transition: >> > State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC >> > cause=C_FSA_INTERNAL origin=do_election_check ] >> > Sep 10 10:25:33 node01 corosync[940]: [TOTEM ] A new membership >> > (10.2.136.56:52 <http://10.2.136.56:52>) was formed. Members left: >> > 167938105 >> > Sep 10 10:25:45 node01 kernel: [74452.740024] d-con pg: PingAck did >> > not arrive in time. >> > Sep 10 10:25:45 node01 kernel: [74452.740169] d-con pg: peer( >> > Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( >> > UpToDate -> DUnknown ) >> > Sep 10 10:25:45 node01 kernel: [74452.740987] d-con pg: asender >> > terminated >> > Sep 10 10:25:45 node01 kernel: [74452.740999] d-con pg: Terminating >> > drbd_a_pg >> > Sep 10 10:25:45 node01 kernel: [74452.741235] d-con pg: Connection >> > closed >> > Sep 10 10:25:45 node01 kernel: [74452.741259] d-con pg: conn( >> > NetworkFailure -> Unconnected ) >> > Sep 10 10:25:45 node01 kernel: [74452.741260] d-con pg: receiver >> > terminated >> > Sep 10 10:25:45 node01 kernel: [74452.741261] d-con pg: Restarting >> > receiver thread >> > Sep 10 10:25:45 node01 kernel: [74452.741262] d-con pg: receiver >> > (re)started >> > Sep 10 10:25:45 node01 kernel: [74452.741269] d-con pg: conn( >> > Unconnected -> WFConnection ) >> > Sep 10 10:26:12 node01 lrmd[1016]: warning: child_timeout_callback: >> > drbd_pg_monitor_31000 process (PID 8445) timed out >> > Sep 10 10:26:12 node01 lrmd[1016]: warning: operation_finished: >> > drbd_pg_monitor_31000:8445 - timed out after 20000ms >> > Sep 10 10:26:12 node01 crmd[1019]: error: process_lrm_event: LRM >> > operation drbd_pg_monitor_31000 (30) Timed Out (timeout=20000ms) >> > Sep 10 10:26:32 node01 crmd[1019]: warning: cib_rsc_callback: >> > Resource update 23 failed: (rc=-62) Timer expired >> > Sep 10 10:27:03 node01 lrmd[1016]: warning: child_timeout_callback: >> > drbd_pg_monitor_31000 process (PID 8693) timed out >> > Sep 10 10:27:03 node01 lrmd[1016]: warning: operation_finished: >> > drbd_pg_monitor_31000:8693 - timed out after 20000ms >> > Sep 10 10:27:54 node01 lrmd[1016]: warning: child_timeout_callback: >> > drbd_pg_monitor_31000 process (PID 8938) timed out >> > Sep 10 10:27:54 node01 lrmd[1016]: warning: operation_finished: >> > drbd_pg_monitor_31000:8938 - timed out after 20000ms >> > Sep 10 10:28:33 node01 crmd[1019]: error: crm_timer_popped: >> > Integration Timer (I_INTEGRATED) just popped in state S_INTEGRATION! >> > (180000ms) >> > Sep 10 10:28:33 node01 crmd[1019]: warning: do_state_transition: >> > Progressed to state S_FINALIZE_JOIN after C_TIMER_POPPED >> > Sep 10 10:28:33 node01 crmd[1019]: warning: do_state_transition: 1 >> > cluster nodes failed to respond to the join offer. >> > Sep 10 10:28:33 node01 crmd[1019]: notice: crmd_join_phase_log: >> > join-1: node02=none >> > Sep 10 10:28:33 node01 crmd[1019]: notice: crmd_join_phase_log: >> > join-1: node01=welcomed >> > Sep 10 10:28:45 node01 lrmd[1016]: warning: child_timeout_callback: >> > drbd_pg_monitor_31000 process (PID 9185) timed out >> > Sep 10 10:28:45 node01 lrmd[1016]: warning: operation_finished: >> > drbd_pg_monitor_31000:9185 - timed out after 20000ms >> > Sep 10 10:29:36 node01 lrmd[1016]: warning: child_timeout_callback: >> > drbd_pg_monitor_31000 process (PID 9432) timed out >> > Sep 10 10:29:36 node01 lrmd[1016]: warning: operation_finished: >> > drbd_pg_monitor_31000:9432 - timed out after 20000ms >> > Sep 10 10:30:27 node01 lrmd[1016]: warning: child_timeout_callback: >> > drbd_pg_monitor_31000 process (PID 9680) timed out >> > Sep 10 10:30:27 node01 lrmd[1016]: warning: operation_finished: >> > drbd_pg_monitor_31000:9680 - timed out after 20000ms >> > Sep 10 10:31:18 node01 lrmd[1016]: warning: child_timeout_callback: >> > drbd_pg_monitor_31000 process (PID 9927) timed out >> > Sep 10 10:31:18 node01 lrmd[1016]: warning: operation_finished: >> > drbd_pg_monitor_31000:9927 - timed out after 20000ms >> > Sep 10 10:32:09 node01 lrmd[1016]: warning: child_timeout_callback: >> > drbd_pg_monitor_31000 process (PID 10174) timed out >> > Sep 10 10:32:09 node01 lrmd[1016]: warning: operation_finished: >> > drbd_pg_monitor_31000:10174 - timed out after 20000ms >> > >> > #crm_mon on node01 before I kill the other vm: >> > Stack: corosync >> > Current DC: node02 (167938104) - partition with quorum >> > Version: 1.1.10-42f2063 >> > 2 Nodes configured >> > 5 Resources configured >> > >> > Online: [ node01 node02 ] >> > >> > Resource Group: PGServer >> > fs_pg (ocf::heartbeat:Filesystem): Started node02 >> > lsb_pg (lsb:postgresql): Started node02 >> > ip_pg (ocf::heartbeat:IPaddr2): Started node02 >> > Master/Slave Set: ms_drbd_pg [drbd_pg] >> > Masters: [ node02 ] >> > Slaves: [ node01 ] >> > >> > Thank you, >> > Kiam >> > >> > >> > >> > >> > _______________________________________________ >> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> > >> > Project Home: http://www.clusterlabs.org >> > Getting started: >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> > Bugs: http://bugs.clusterlabs.org >> > >> >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> > >
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org