15.09.2014 04:24, Norbert Kiam Maclang wrote: > Hi Vladislav and Andrew, > > After adding fencing/stonith (resource level) and fencing handlers on > drbd, I am not getting monitor timeouts on drbd but I am experiencing a > different problem now. As per my understanding, logs on node01 showed > that it detects node02 to be disconnected (and moved the resources to > itself) but crm_mon shows that the resources are still started on node02 > which is not.
That is probably the root of your issues. That _may_ be caused by the fact that VMs are not rescheduled to run on host CPUs fair enough. You'd need either to re-think the whole architecture of your cluster or to some-how tune your cluster messaging layer to deal with that. Increasing 'totem.token' would be the first step. See man corosync.conf and cman docs. > > Node01: > > node01 crmd[952]: notice: do_state_transition: State transition S_IDLE > -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED > origin=crm_timer_popped ] > node01 pengine[951]: notice: unpack_config: On loss of CCM Quorum: Ignore > node01 crmd[952]: notice: run_graph: Transition 260 (Complete=0, > Pending=0, Fired=0, Skipped=0, Incomplete=0, > Source=/var/lib/pacemaker/pengine/pe-input-78.bz2): Complete > node01 crmd[952]: notice: do_state_transition: State transition > S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL > origin=notify_crmd ] > node01 pengine[951]: notice: process_pe_message: Calculated Transition > 260: /var/lib/pacemaker/pengine/pe-input-78.bz2 > node01 corosync[917]: [TOTEM ] A processor failed, forming new > configuration. > node01 corosync[917]: [TOTEM ] A new membership (10.2.131.20:352 > <http://10.2.131.20:352>) was formed. Members left: 167936789 > node01 crmd[952]: warning: match_down_event: No match for shutdown > action on 167936789 > node01 crmd[952]: notice: peer_update_callback: Stonith/shutdown of > node02 not matched > node01 crmd[952]: notice: do_state_transition: State transition S_IDLE > -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL > origin=abort_transition_graph ] > node01 pengine[951]: notice: unpack_config: On loss of CCM Quorum: Ignore > node01 pengine[951]: warning: pe_fence_node: Node node02 will be fenced > because our peer process is no longer available > node01 pengine[951]: warning: determine_online_status: Node node02 is > unclean > node01 pengine[951]: warning: stage6: Scheduling Node node02 for STONITH > node01 pengine[951]: notice: LogActions: Move fs_pg#011(Started > node02 -> node01) > node01 pengine[951]: notice: LogActions: Move ip_pg#011(Started > node02 -> node01) > node01 pengine[951]: notice: LogActions: Move lsb_pg#011(Started > node02 -> node01) > node01 pengine[951]: notice: LogActions: Demote drbd_pg:0#011(Master > -> Stopped node02) > node01 pengine[951]: notice: LogActions: Promote drbd_pg:1#011(Slave > -> Master node01) > node01 pengine[951]: notice: LogActions: Stop p_fence:0#011(node02) > node01 crmd[952]: notice: te_rsc_command: Initiating action 2: cancel > drbd_pg_cancel_31000 on node01 (local) > node01 crmd[952]: notice: te_fence_node: Executing reboot fencing > operation (54) on node02 (timeout=60000) > node01 stonith-ng[948]: notice: handle_request: Client > crmd.952.6d7ac808 wants to fence (reboot) 'node02' with device '(any)' > node01 stonith-ng[948]: notice: initiate_remote_stonith_op: Initiating > remote operation reboot for node02: 96530c7b-1c80-42c4-82cf-840bf3d5bb5f (0) > node01 crmd[952]: notice: te_rsc_command: Initiating action 68: notify > drbd_pg_pre_notify_demote_0 on node02 > node01 crmd[952]: notice: te_rsc_command: Initiating action 70: notify > drbd_pg_pre_notify_demote_0 on node01 (local) > node01 pengine[951]: warning: process_pe_message: Calculated Transition > 261: /var/lib/pacemaker/pengine/pe-warn-0.bz2 > node01 crmd[952]: notice: process_lrm_event: LRM operation > drbd_pg_notify_0 (call=63, rc=0, cib-update=0, confirmed=true) ok > node01 kernel: [230495.836024] d-con pg: PingAck did not arrive in time. > node01 kernel: [230495.836176] d-con pg: peer( Primary -> Unknown ) > conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) > node01 kernel: [230495.837204] d-con pg: asender terminated > node01 kernel: [230495.837216] d-con pg: Terminating drbd_a_pg > node01 kernel: [230495.837286] d-con pg: Connection closed > node01 kernel: [230495.837298] d-con pg: conn( NetworkFailure -> > Unconnected ) > node01 kernel: [230495.837299] d-con pg: receiver terminated > node01 kernel: [230495.837300] d-con pg: Restarting receiver thread > node01 kernel: [230495.837304] d-con pg: receiver (re)started > node01 kernel: [230495.837314] d-con pg: conn( Unconnected -> > WFConnection ) > node01 crmd[952]: warning: action_timer_callback: Timer popped > (timeout=20000, abort_level=1000000, complete=false) > node01 crmd[952]: error: print_synapse: [Action 2]: Completed rsc > op drbd_pg_cancel_31000 on node01 (priority: 0, waiting: none) > node01 crmd[952]: warning: action_timer_callback: Timer popped > (timeout=20000, abort_level=1000000, complete=false) > node01 crmd[952]: error: print_synapse: [Action 68]: In-flight rsc > op drbd_pg_pre_notify_demote_0 on node02 (priority: 0, waiting: none) > node01 crmd[952]: warning: cib_action_update: rsc_op 68: > drbd_pg_pre_notify_demote_0 on node02 timed out > node01 crmd[952]: error: cib_action_updated: Update 297 FAILED: Timer > expired > node01 crmd[952]: error: stonith_async_timeout_handler: Async call 2 > timed out after 120000ms > node01 crmd[952]: notice: tengine_stonith_callback: Stonith operation > 2/54:261:0:6978227d-ce2d-4dc6-955a-eb9313f112a5: Timer expired (-62) > node01 crmd[952]: notice: tengine_stonith_callback: Stonith operation > 2 for node02 failed (Timer expired): aborting transition. > node01 crmd[952]: notice: run_graph: Transition 261 (Complete=6, > Pending=0, Fired=0, Skipped=29, Incomplete=15, > Source=/var/lib/pacemaker/pengine/pe-warn-0.bz2): Stopped > node01 pengine[951]: notice: unpack_config: On loss of CCM Quorum: Ignore > node01 pengine[951]: warning: pe_fence_node: Node node02 will be fenced > because our peer process is no longer available > node01 pengine[951]: warning: determine_online_status: Node node02 is > unclean > node01 pengine[951]: warning: stage6: Scheduling Node node02 for STONITH > node01 pengine[951]: notice: LogActions: Move fs_pg#011(Started > node02 -> node01) > node01 pengine[951]: notice: LogActions: Move ip_pg#011(Started > node02 -> node01) > node01 pengine[951]: notice: LogActions: Move lsb_pg#011(Started > node02 -> node01) > node01 pengine[951]: notice: LogActions: Demote drbd_pg:0#011(Master > -> Stopped node02) > node01 pengine[951]: notice: LogActions: Promote drbd_pg:1#011(Slave > -> Master node01) > node01 pengine[951]: notice: LogActions: Stop p_fence:0#011(node02) > node01 crmd[952]: notice: te_fence_node: Executing reboot fencing > operation (53) on node02 (timeout=60000) > node01 stonith-ng[948]: notice: handle_request: Client > crmd.952.6d7ac808 wants to fence (reboot) 'node02' with device '(any)' > node01 stonith-ng[948]: notice: initiate_remote_stonith_op: Initiating > remote operation reboot for node02: a4fae8ce-3a6c-4fe5-a934-b5b83ae123cb (0) > node01 crmd[952]: notice: te_rsc_command: Initiating action 67: notify > drbd_pg_pre_notify_demote_0 on node02 > node01 crmd[952]: notice: te_rsc_command: Initiating action 69: notify > drbd_pg_pre_notify_demote_0 on node01 (local) > node01 pengine[951]: warning: process_pe_message: Calculated Transition > 262: /var/lib/pacemaker/pengine/pe-warn-1.bz2 > node01 crmd[952]: notice: process_lrm_event: LRM operation > drbd_pg_notify_0 (call=66, rc=0, cib-update=0, confirmed=true) ok > > Last updated: Mon Sep 15 01:15:59 2014 > Last change: Sat Sep 13 15:23:45 2014 via cibadmin on node01 > Stack: corosync > Current DC: node01 (167936788) - partition with quorum > Version: 1.1.10-42f2063 > 2 Nodes configured > 7 Resources configured > > > Node node02 (167936789): UNCLEAN (online) > Online: [ node01 ] > > Resource Group: PGServer > fs_pg (ocf::heartbeat:Filesystem): Started node02 > ip_pg (ocf::heartbeat:IPaddr2): Started node02 > lsb_pg (lsb:postgresql): Started node02 > Master/Slave Set: ms_drbd_pg [drbd_pg] > Masters: [ node02 ] > Slaves: [ node01 ] > Clone Set: cln_p_fence [p_fence] > Started: [ node01 node02 ] > > Thank you, > Norbert > > On Fri, Sep 12, 2014 at 12:06 PM, Vladislav Bogdanov > <bub...@hoster-ok.com <mailto:bub...@hoster-ok.com>> wrote: > > 12.09.2014 05:00, Norbert Kiam Maclang wrote: > > Hi, > > > > After adding resource level fencing on drbd, I still ended up having > > problems with timeouts on drbd. Is there a recommended settings for > > this? I followed what is written in the drbd documentation - > > > > http://www.drbd.org/users-guide-emb/s-pacemaker-crm-drbd-backed-service.html > > , Another thing I can't understand is why during initial tests, even I > > reboot the vms several times, failover works. But after I soak it > for a > > couple of hours (say for example 8 hours or more) and continue > with the > > tests, it will not failover and experience split brain. I confirmed it > > though that everything is healthy before performing a reboot. Disk > > health and network is good, drbd is synced, time beetween servers > is good. > > I recall I've seen something similar a year ago (near the time your > pacemaker version is dated). I do not remember what was the exact > problem cause, but I saw that drbd RA timeouts because it waits for > something (fencing) in the kernel space to be done. drbd calls userspace > scripts from within kernelspace, and you'll see them in the process list > with the drbd kernel thread as a parent. > > I'd also upgrade your corosync configuration from "member" to "nodelist" > syntax, specifying "name" parameter together with ring0_addr for nodes > (that parameter is not referenced in corosync docs but should be > somewhere in the Pacemaker Explained - it is used only by the > pacemaker). > > Also there is trace_ra functionality support in both pacemaker and crmsh > (cannot say if that is supported in versions you have though, probably > yes) so you may want to play with that to get the exact picture from the > resource agent. > > Anyways, upgrading to 1.1.12 and more recent crmsh is nice to have for > you because you may be just hitting a long-ago solved and forgotten > bug/issue. > > Concerning your > > expected-quorum-votes="1" > > You need to configure votequorum in corosync with two_node: 1 instead of > that line. > > > > > # Logs: > > node01 lrmd[1036]: warning: child_timeout_callback: > > drbd_pg_monitor_29000 process (PID 27744) timed out > > node01 lrmd[1036]: warning: operation_finished: > > drbd_pg_monitor_29000:27744 - timed out after 20000ms > > node01 crmd[1039]: error: process_lrm_event: LRM operation > > drbd_pg_monitor_29000 (69) Timed Out (timeout=20000ms) > > node01 crmd[1039]: warning: update_failcount: Updating failcount for > > drbd_pg on tyo1mqdb01p after failed monitor: rc=1 (update=value++, > > time=1410486352) > > > > Thanks, > > Kiam > > > > On Thu, Sep 11, 2014 at 6:58 PM, Norbert Kiam Maclang > > <norbert.kiam.macl...@gmail.com > <mailto:norbert.kiam.macl...@gmail.com> > <mailto:norbert.kiam.macl...@gmail.com > <mailto:norbert.kiam.macl...@gmail.com>>> > > wrote: > > > > Thank you Vladislav. > > > > I have configured resource level fencing on drbd and removed > > wfc-timeout and defr-wfc-timeout (is this required?). My drbd > > configuration is now: > > > > resource pg { > > device /dev/drbd0; > > disk /dev/vdb; > > meta-disk internal; > > disk { > > fencing resource-only; > > on-io-error detach; > > resync-rate 40M; > > } > > handlers { > > fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; > > after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh"; > > split-brain "/usr/lib/drbd/notify-split-brain.sh nkbm"; > > } > > on node01 { > > address 10.2.136.52:7789 <http://10.2.136.52:7789> > <http://10.2.136.52:7789>; > > } > > on node02 { > > address 10.2.136.55:7789 <http://10.2.136.55:7789> > <http://10.2.136.55:7789>; > > } > > net { > > verify-alg md5; > > after-sb-0pri discard-zero-changes; > > after-sb-1pri discard-secondary; > > after-sb-2pri disconnect; > > } > > } > > > > Failover works on my initial test (restarting both nodes > alternately > > - this always works). Will wait for a couple of hours after > doing a > > failover test again (Which always fail on my previous setup). > > > > Thank you! > > Kiam > > > > On Thu, Sep 11, 2014 at 2:14 PM, Vladislav Bogdanov > > <bub...@hoster-ok.com <mailto:bub...@hoster-ok.com> > <mailto:bub...@hoster-ok.com <mailto:bub...@hoster-ok.com>>> wrote: > > > > 11.09.2014 05:57, Norbert Kiam Maclang wrote: > > > Is this something to do with quorum? But I already set > > > > You'd need to configure fencing at the drbd resources level. > > > > > > http://www.drbd.org/users-guide-emb/s-pacemaker-fencing.html#s-pacemaker-fencing-cib > > > > > > > > > > property no-quorum-policy="ignore" \ > > > expected-quorum-votes="1" > > > > > > Thanks in advance, > > > Kiam > > > > > > On Thu, Sep 11, 2014 at 10:09 AM, Norbert Kiam Maclang > > > <norbert.kiam.macl...@gmail.com > <mailto:norbert.kiam.macl...@gmail.com> > > <mailto:norbert.kiam.macl...@gmail.com > <mailto:norbert.kiam.macl...@gmail.com>> > > <mailto:norbert.kiam.macl...@gmail.com > <mailto:norbert.kiam.macl...@gmail.com> > > <mailto:norbert.kiam.macl...@gmail.com > <mailto:norbert.kiam.macl...@gmail.com>>>> > > > wrote: > > > > > > Hi, > > > > > > Please help me understand what is causing the problem. I > > have a 2 > > > node cluster running on vms using KVM. Each vm (I am > using > > Ubuntu > > > 14.04) runs on a separate hypervisor on separate > machines. > > All are > > > working well during testing (I restarted the vms > > alternately), but > > > after a day when I kill the other node, I always end up > > corosync and > > > pacemaker hangs on the surviving node. Date and time on > > the vms are > > > in sync, I use unicast, tcpdump shows both nodes > exchanges, > > > confirmed that DRBD is healthy and crm_mon show good > > status before I > > > kill the other node. Below are my configurations and > > versions I used: > > > > > > corosync 2.3.3-1ubuntu1 > > > crmsh 1.2.5+hg1034-1ubuntu3 > > > drbd8-utils 2:8.4.4-1ubuntu1 > > > libcorosync-common4 2.3.3-1ubuntu1 > > > libcrmcluster4 1.1.10+git20130802-1ubuntu2 > > > libcrmcommon3 1.1.10+git20130802-1ubuntu2 > > > libcrmservice1 1.1.10+git20130802-1ubuntu2 > > > pacemaker 1.1.10+git20130802-1ubuntu2 > > > pacemaker-cli-utils 1.1.10+git20130802-1ubuntu2 > > > postgresql-9.3 9.3.5-0ubuntu0.14.04.1 > > > > > > # /etc/corosync/corosync: > > > totem { > > > version: 2 > > > token: 3000 > > > token_retransmits_before_loss_const: 10 > > > join: 60 > > > consensus: 3600 > > > vsftype: none > > > max_messages: 20 > > > clear_node_high_bit: yes > > > secauth: off > > > threads: 0 > > > rrp_mode: none > > > interface { > > > member { > > > memberaddr: 10.2.136.56 > > > } > > > member { > > > memberaddr: 10.2.136.57 > > > } > > > ringnumber: 0 > > > bindnetaddr: 10.2.136.0 > > > mcastport: 5405 > > > } > > > transport: udpu > > > } > > > amf { > > > mode: disabled > > > } > > > quorum { > > > provider: corosync_votequorum > > > expected_votes: 1 > > > } > > > aisexec { > > > user: root > > > group: root > > > } > > > logging { > > > fileline: off > > > to_stderr: yes > > > to_logfile: no > > > to_syslog: yes > > > syslog_facility: daemon > > > debug: off > > > timestamp: on > > > logger_subsys { > > > subsys: AMF > > > debug: off > > > tags: > > enter|leave|trace1|trace2|trace3|trace4|trace6 > > > } > > > } > > > > > > # /etc/corosync/service.d/pcmk: > > > service { > > > name: pacemaker > > > ver: 1 > > > } > > > > > > /etc/drbd.d/global_common.conf: > > > global { > > > usage-count no; > > > } > > > > > > common { > > > net { > > > protocol C; > > > } > > > } > > > > > > # /etc/drbd.d/pg.res: > > > resource pg { > > > device /dev/drbd0; > > > disk /dev/vdb; > > > meta-disk internal; > > > startup { > > > wfc-timeout 15; > > > degr-wfc-timeout 60; > > > } > > > disk { > > > on-io-error detach; > > > resync-rate 40M; > > > } > > > on node01 { > > > address 10.2.136.56:7789 > <http://10.2.136.56:7789> <http://10.2.136.56:7789> > > <http://10.2.136.56:7789>; > > > } > > > on node02 { > > > address 10.2.136.57:7789 > <http://10.2.136.57:7789> <http://10.2.136.57:7789> > > <http://10.2.136.57:7789>; > > > } > > > net { > > > verify-alg md5; > > > after-sb-0pri discard-zero-changes; > > > after-sb-1pri discard-secondary; > > > after-sb-2pri disconnect; > > > } > > > } > > > > > > # Pacemaker configuration: > > > node $id="167938104" node01 > > > node $id="167938105" node02 > > > primitive drbd_pg ocf:linbit:drbd \ > > > params drbd_resource="pg" \ > > > op monitor interval="29s" role="Master" \ > > > op monitor interval="31s" role="Slave" > > > primitive fs_pg ocf:heartbeat:Filesystem \ > > > params device="/dev/drbd0" > > directory="/var/lib/postgresql/9.3/main" > > > fstype="ext4" > > > primitive ip_pg ocf:heartbeat:IPaddr2 \ > > > params ip="10.2.136.59" cidr_netmask="24" nic="eth0" > > > primitive lsb_pg lsb:postgresql > > > group PGServer fs_pg lsb_pg ip_pg > > > ms ms_drbd_pg drbd_pg \ > > > meta master-max="1" master-node-max="1" clone-max="2" > > > clone-node-max="1" notify="true" > > > colocation pg_on_drbd inf: PGServer ms_drbd_pg:Master > > > order pg_after_drbd inf: ms_drbd_pg:promote > PGServer:start > > > property $id="cib-bootstrap-options" \ > > > dc-version="1.1.10-42f2063" \ > > > cluster-infrastructure="corosync" \ > > > stonith-enabled="false" \ > > > no-quorum-policy="ignore" > > > rsc_defaults $id="rsc-options" \ > > > resource-stickiness="100" > > > > > > # Logs on node01 > > > Sep 10 10:25:33 node01 crmd[1019]: notice: > > peer_update_callback: > > > Our peer on the DC is dead > > > Sep 10 10:25:33 node01 crmd[1019]: notice: > > do_state_transition: > > > State transition S_NOT_DC -> S_ELECTION [ > input=I_ELECTION > > > cause=C_CRMD_STATUS_CALLBACK > origin=peer_update_callback ] > > > Sep 10 10:25:33 node01 crmd[1019]: notice: > > do_state_transition: > > > State transition S_ELECTION -> S_INTEGRATION [ > > input=I_ELECTION_DC > > > cause=C_FSA_INTERNAL origin=do_election_check ] > > > Sep 10 10:25:33 node01 corosync[940]: [TOTEM ] A new > > membership > > > (10.2.136.56:52 <http://10.2.136.56:52> > <http://10.2.136.56:52> > > <http://10.2.136.56:52>) was formed. Members left: > > > 167938105 > > > Sep 10 10:25:45 node01 kernel: [74452.740024] d-con pg: > > PingAck did > > > not arrive in time. > > > Sep 10 10:25:45 node01 kernel: [74452.740169] d-con > pg: peer( > > > Primary -> Unknown ) conn( Connected -> > NetworkFailure ) pdsk( > > > UpToDate -> DUnknown ) > > > Sep 10 10:25:45 node01 kernel: [74452.740987] d-con pg: > > asender > > > terminated > > > Sep 10 10:25:45 node01 kernel: [74452.740999] d-con pg: > > Terminating > > > drbd_a_pg > > > Sep 10 10:25:45 node01 kernel: [74452.741235] d-con pg: > > Connection > > > closed > > > Sep 10 10:25:45 node01 kernel: [74452.741259] d-con > pg: conn( > > > NetworkFailure -> Unconnected ) > > > Sep 10 10:25:45 node01 kernel: [74452.741260] d-con pg: > > receiver > > > terminated > > > Sep 10 10:25:45 node01 kernel: [74452.741261] d-con pg: > > Restarting > > > receiver thread > > > Sep 10 10:25:45 node01 kernel: [74452.741262] d-con pg: > > receiver > > > (re)started > > > Sep 10 10:25:45 node01 kernel: [74452.741269] d-con > pg: conn( > > > Unconnected -> WFConnection ) > > > Sep 10 10:26:12 node01 lrmd[1016]: warning: > > child_timeout_callback: > > > drbd_pg_monitor_31000 process (PID 8445) timed out > > > Sep 10 10:26:12 node01 lrmd[1016]: warning: > > operation_finished: > > > drbd_pg_monitor_31000:8445 - timed out after 20000ms > > > Sep 10 10:26:12 node01 crmd[1019]: error: > > process_lrm_event: LRM > > > operation drbd_pg_monitor_31000 (30) Timed Out > > (timeout=20000ms) > > > Sep 10 10:26:32 node01 crmd[1019]: warning: > cib_rsc_callback: > > > Resource update 23 failed: (rc=-62) Timer expired > > > Sep 10 10:27:03 node01 lrmd[1016]: warning: > > child_timeout_callback: > > > drbd_pg_monitor_31000 process (PID 8693) timed out > > > Sep 10 10:27:03 node01 lrmd[1016]: warning: > > operation_finished: > > > drbd_pg_monitor_31000:8693 - timed out after 20000ms > > > Sep 10 10:27:54 node01 lrmd[1016]: warning: > > child_timeout_callback: > > > drbd_pg_monitor_31000 process (PID 8938) timed out > > > Sep 10 10:27:54 node01 lrmd[1016]: warning: > > operation_finished: > > > drbd_pg_monitor_31000:8938 - timed out after 20000ms > > > Sep 10 10:28:33 node01 crmd[1019]: error: > crm_timer_popped: > > > Integration Timer (I_INTEGRATED) just popped in state > > S_INTEGRATION! > > > (180000ms) > > > Sep 10 10:28:33 node01 crmd[1019]: warning: > > do_state_transition: > > > Progressed to state S_FINALIZE_JOIN after C_TIMER_POPPED > > > Sep 10 10:28:33 node01 crmd[1019]: warning: > > do_state_transition: 1 > > > cluster nodes failed to respond to the join offer. > > > Sep 10 10:28:33 node01 crmd[1019]: notice: > > crmd_join_phase_log: > > > join-1: node02=none > > > Sep 10 10:28:33 node01 crmd[1019]: notice: > > crmd_join_phase_log: > > > join-1: node01=welcomed > > > Sep 10 10:28:45 node01 lrmd[1016]: warning: > > child_timeout_callback: > > > drbd_pg_monitor_31000 process (PID 9185) timed out > > > Sep 10 10:28:45 node01 lrmd[1016]: warning: > > operation_finished: > > > drbd_pg_monitor_31000:9185 - timed out after 20000ms > > > Sep 10 10:29:36 node01 lrmd[1016]: warning: > > child_timeout_callback: > > > drbd_pg_monitor_31000 process (PID 9432) timed out > > > Sep 10 10:29:36 node01 lrmd[1016]: warning: > > operation_finished: > > > drbd_pg_monitor_31000:9432 - timed out after 20000ms > > > Sep 10 10:30:27 node01 lrmd[1016]: warning: > > child_timeout_callback: > > > drbd_pg_monitor_31000 process (PID 9680) timed out > > > Sep 10 10:30:27 node01 lrmd[1016]: warning: > > operation_finished: > > > drbd_pg_monitor_31000:9680 - timed out after 20000ms > > > Sep 10 10:31:18 node01 lrmd[1016]: warning: > > child_timeout_callback: > > > drbd_pg_monitor_31000 process (PID 9927) timed out > > > Sep 10 10:31:18 node01 lrmd[1016]: warning: > > operation_finished: > > > drbd_pg_monitor_31000:9927 - timed out after 20000ms > > > Sep 10 10:32:09 node01 lrmd[1016]: warning: > > child_timeout_callback: > > > drbd_pg_monitor_31000 process (PID 10174) timed out > > > Sep 10 10:32:09 node01 lrmd[1016]: warning: > > operation_finished: > > > drbd_pg_monitor_31000:10174 - timed out after 20000ms > > > > > > #crm_mon on node01 before I kill the other vm: > > > Stack: corosync > > > Current DC: node02 (167938104) - partition with quorum > > > Version: 1.1.10-42f2063 > > > 2 Nodes configured > > > 5 Resources configured > > > > > > Online: [ node01 node02 ] > > > > > > Resource Group: PGServer > > > fs_pg (ocf::heartbeat:Filesystem): > Started node02 > > > lsb_pg (lsb:postgresql): Started node02 > > > ip_pg (ocf::heartbeat:IPaddr2): > Started node02 > > > Master/Slave Set: ms_drbd_pg [drbd_pg] > > > Masters: [ node02 ] > > > Slaves: [ node01 ] > > > > > > Thank you, > > > Kiam > > > > > > > > > > > > > > > _______________________________________________ > > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > <mailto:Pacemaker@oss.clusterlabs.org> > > <mailto:Pacemaker@oss.clusterlabs.org > <mailto:Pacemaker@oss.clusterlabs.org>> > > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > > > Project Home: http://www.clusterlabs.org > > > Getting started: > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > > Bugs: http://bugs.clusterlabs.org > > > > > > > > > _______________________________________________ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > <mailto:Pacemaker@oss.clusterlabs.org> > > <mailto:Pacemaker@oss.clusterlabs.org > <mailto:Pacemaker@oss.clusterlabs.org>> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > > > > > > > > > > _______________________________________________ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > <mailto:Pacemaker@oss.clusterlabs.org> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > <mailto:Pacemaker@oss.clusterlabs.org> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > > > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org