Hello,
I have a 3 node Pacemaker+Heartbeat cluster (two real nodes and one "standby" quorum node) with Ubuntu 10.04 LTS on the nodes and using the Pacemaker+Heartbeat packages from the Ubuntu HA Team PPA ( https://launchpad.net/~ubuntu-ha-maintainers/+archive/ppa ). I have configured 3 DRBD resources, a filesystem mount, and a KVM-based virtual machine (using the VirtualDomain resource). I have constraints in place so that the DRBD devices must become primary and the filesystem must be mounted before the VM can start: node $id="1ab0690c-5aa0-4d9c-ae4e-b662e0ca54e5" vmhost1 node $id="219e9bf6-ea99-41f4-895f-4c2c5c78484a" quorumnode \ attributes standby="on" node $id="645e09b4-aee5-4cec-a241-8bd4e03a78c3" vmhost2 primitive p_drbd_mount2 ocf:linbit:drbd \ params drbd_resource="mount2" \ op start interval="0" timeout="240" \ op stop interval="0" timeout="100" \ op monitor interval="10" role="Master" timeout="30" \ op monitor interval="20" role="Slave" timeout="30" primitive p_drbd_mount1 ocf:linbit:drbd \ params drbd_resource="mount1" \ op start interval="0" timeout="240" \ op stop interval="0" timeout="100" \ op monitor interval="10" role="Master" timeout="30" \ op monitor interval="20" role="Slave" timeout="30" primitive p_drbd_vmstore ocf:linbit:drbd \ params drbd_resource="vmstore" \ op start interval="0" timeout="240" \ op stop interval="0" timeout="100" \ op monitor interval="10" role="Master" timeout="30" \ op monitor interval="20" role="Slave" timeout="30" primitive p_fs_vmstore ocf:heartbeat:Filesystem \ params device="/dev/drbd0" directory="/mnt/storage/vmstore" fstype="ext4" \ op start interval="0" timeout="60" \ op stop interval="0" timeout="60" \ op monitor interval="20" timeout="40" primitive p_ping ocf:pacemaker:ping \ params name="p_ping" host_list="192.168.1.25 192.168.1.26" multiplier="1000" \ op start interval="0" timeout="60" \ op monitor interval="20s" timeout="60" primitive p_sysadmin_notify ocf:heartbeat:MailTo \ params email="al...@example.com" \ params subject="Pacemaker Change" \ op start interval="0" timeout="10" \ op stop interval="0" timeout="10" \ op monitor interval="10" timeout="10" primitive p_vm_myvm ocf:heartbeat:VirtualDomain \ params config="/mnt/storage/vmstore/config/myvm.xml" \ meta allow-migrate="false" target-role="Started" is-managed="true" \ op start interval="0" timeout="180" \ op stop interval="0" timeout="180" \ op monitor interval="10" timeout="30" primitive stonithquorumnode stonith:external/webpowerswitch \ params wps_ipaddr="192.168.3.100" wps_port="x" wps_username="xxx" wps_password="xxx" hostname_to_stonith="quorumnode" primitive stonithvmhost1 stonith:external/webpowerswitch \ params wps_ipaddr="192.168.3.100" wps_port="x" wps_username="xxx" wps_password="xxx" hostname_to_stonith="vmhost1" primitive stonithvmhost2 stonith:external/webpowerswitch \ params wps_ipaddr="192.168.3.100" wps_port="x" wps_username="xxx" wps_password="xxx" hostname_to_stonith="vmhost2" group g_vm p_fs_vmstore p_vm_myvm ms ms_drbd_mount2 p_drbd_mount2 \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" ms ms_drbd_mount1 p_drbd_mount1 \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" ms ms_drbd_vmstore p_drbd_vmstore \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" clone cl_ping p_ping \ meta interleave="true" clone cl_sysadmin_notify p_sysadmin_notify location loc_run_on_most_connected g_vm \ rule $id="loc_run_on_most_connected-rule" p_ping: defined p_ping location loc_st_nodescan stonithquorumnode -inf: vmhost1 location loc_st_vmhost1 stonithvmhost1 -inf: vmhost1 location loc_st_vmhost2 stonithvmhost2 -inf: vmhost2 colocation c_drbd_libvirt_vm inf: g_vm ms_drbd_vmstore:Master ms_drbd_tools:Master ms_drbd_crm:Master order o_drbd-fs-vm inf: ms_drbd_vmstore:promote ms_drbd_tools:promote ms_drbd_crm:promote g_vm:start property $id="cib-bootstrap-options" \ dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \ cluster-infrastructure="Heartbeat" \ stonith-enabled="true" \ no-quorum-policy="freeze" \ last-lrm-refresh="1337746179" This has been working well, however last week Pacemaker all of a sudden stopped the p_vm_myvm resource and then started it up again. I have attached the relevant section of /var/log/daemon.log - I am unable to determine what caused Pacemaker to restart this resource. Based on the log, could you tell me what event triggered this? Thanks, Andrew
Jun 14 15:25:00 vmhost1 lrmd: [3853]: info: rsc:p_sysadmin_notify:0 monitor[18] (pid 3661) Jun 14 15:25:00 vmhost1 lrmd: [3853]: info: operation monitor[18] on p_sysadmin_notify:0 for client 3856: pid 3661 exited with return code 0 Jun 14 15:26:42 vmhost1 cib: [3852]: info: cib_stats: Processed 219 operations (182.00us average, 0% utilization) in the last 10min Jun 14 15:32:43 vmhost1 lrmd: [3853]: info: operation monitor[22] on p_ping:0 for client 3856: pid 10059 exited with return code 0 Jun 14 15:35:27 vmhost1 lrmd: [3853]: info: rsc:p_drbd_vmstore:0 monitor[55] (pid 12323) Jun 14 15:35:27 vmhost1 lrmd: [3853]: info: rsc:p_drbd_mount2:0 monitor[53] (pid 12324) Jun 14 15:35:27 vmhost1 lrmd: [3853]: info: operation monitor[55] on p_drbd_vmstore:0 for client 3856: pid 12323 exited with return code 8 Jun 14 15:35:27 vmhost1 lrmd: [3853]: info: operation monitor[53] on p_drbd_mount2:0 for client 3856: pid 12324 exited with return code 8 Jun 14 15:35:31 vmhost1 lrmd: [3853]: info: rsc:p_drbd_mount1:0 monitor[54] (pid 12396) Jun 14 15:35:31 vmhost1 lrmd: [3853]: info: operation monitor[54] on p_drbd_mount1:0 for client 3856: pid 12396 exited with return code 8 Jun 14 15:36:42 vmhost1 cib: [3852]: info: cib_stats: Processed 220 operations (272.00us average, 0% utilization) in the last 10min Jun 14 15:37:34 vmhost1 lrmd: [3853]: info: rsc:p_vm_myvm monitor[57] (pid 14061) Jun 14 15:37:34 vmhost1 lrmd: [3853]: info: operation monitor[57] on p_vm_myvm for client 3856: pid 14061 exited with return code 0 Jun 14 15:42:35 vmhost1 attrd: [3855]: notice: attrd_trigger_update: Sending flush op to all hosts for: p_ping (1000) Jun 14 15:42:35 vmhost1 attrd: [3855]: notice: attrd_perform_update: Sent update 163: p_ping=1000 Jun 14 15:42:36 vmhost1 lrmd: [3853]: info: cancel_op: operation monitor[53] on p_drbd_mount2:0 for client 3856, its parameters: CRM_meta_clone=[0] CRM_meta_role=[Master] CRM_meta_notify_slave_resource=[ ] CRM_meta_notify_active_resource=[ ] CRM_meta_timeout=[30000] CRM_meta_notify_demote_uname=[ ] drbd_resource=[crm] CRM_meta_master_node_max=[1] CRM_meta_notify_stop_resource=[ ] CRM_meta_notify_master_resource=[ ] CRM_meta_clone_node_max=[1] CRM_meta_notify_demote_resource=[ ] CRM_meta_clone_max=[2] CRM_meta_notify_start_resource=[p_drbd_mount2:0 ] CRM_meta_notify_ cancelled Jun 14 15:42:36 vmhost1 crmd: [3856]: info: send_direct_ack: ACK'ing resource op p_drbd_mount2:0_monitor_10000 from 1:2351:0:7f6d66f7-cfe5-4820-8289-0e47d8c9102b: lrm_invoke-lrmd-1339706556-53 Jun 14 15:42:36 vmhost1 lrmd: [3853]: info: cancel_op: operation monitor[54] on p_drbd_mount1:0 for client 3856, its parameters: CRM_meta_clone=[0] CRM_meta_role=[Master] CRM_meta_notify_slave_resource=[ ] CRM_meta_notify_active_resource=[ ] CRM_meta_timeout=[30000] CRM_meta_notify_demote_uname=[ ] drbd_resource=[tools] CRM_meta_master_node_max=[1] CRM_meta_notify_stop_resource=[ ] CRM_meta_notify_master_resource=[ ] CRM_meta_clone_node_max=[1] CRM_meta_notify_demote_resource=[ ] CRM_meta_clone_max=[2] CRM_meta_notify_start_resource=[p_drbd_mount1:0 ] CRM_meta_n cancelled Jun 14 15:42:36 vmhost1 crmd: [3856]: info: send_direct_ack: ACK'ing resource op p_drbd_mount1:0_monitor_10000 from 6:2351:0:7f6d66f7-cfe5-4820-8289-0e47d8c9102b: lrm_invoke-lrmd-1339706556-54 Jun 14 15:42:36 vmhost1 lrmd: [3853]: info: cancel_op: operation monitor[55] on p_drbd_vmstore:0 for client 3856, its parameters: CRM_meta_clone=[0] CRM_meta_role=[Master] CRM_meta_notify_slave_resource=[ ] CRM_meta_notify_active_resource=[ ] CRM_meta_timeout=[30000] CRM_meta_notify_demote_uname=[ ] drbd_resource=[vmstore] CRM_meta_master_node_max=[1] CRM_meta_notify_stop_resource=[ ] CRM_meta_notify_master_resource=[ ] CRM_meta_clone_node_max=[1] CRM_meta_notify_demote_resource=[ ] CRM_meta_clone_max=[2] CRM_meta_notify_start_resource=[p_drbd_vmstore:0 ] CRM_ cancelled Jun 14 15:42:36 vmhost1 crmd: [3856]: info: send_direct_ack: ACK'ing resource op p_drbd_vmstore:0_monitor_10000 from 2:2351:0:7f6d66f7-cfe5-4820-8289-0e47d8c9102b: lrm_invoke-lrmd-1339706556-55 Jun 14 15:42:36 vmhost1 lrmd: [3853]: info: cancel_op: operation monitor[57] on p_vm_myvm for client 3856, its parameters: CRM_meta_name=[monitor] crm_feature_set=[3.0.5] config=[/mnt/storage/vmstore/config/myvm.xml] CRM_meta_timeout=[30000] CRM_meta_interval=[10000] cancelled Jun 14 15:42:36 vmhost1 crmd: [3856]: info: do_lrm_rsc_op: Performing key=136:2351:0:7f6d66f7-cfe5-4820-8289-0e47d8c9102b op=p_vm_myvm_stop_0 ) Jun 14 15:42:36 vmhost1 lrmd: [3853]: info: rsc:p_vm_myvm stop[58] (pid 18174) Jun 14 15:42:36 vmhost1 crmd: [3856]: info: do_lrm_rsc_op: Performing key=168:2351:0:7f6d66f7-cfe5-4820-8289-0e47d8c9102b op=p_drbd_mount2:0_notify_0 ) Jun 14 15:42:36 vmhost1 lrmd: [3853]: info: rsc:p_drbd_mount2:0 notify[59] (pid 18175) Jun 14 15:42:36 vmhost1 crmd: [3856]: info: do_lrm_rsc_op: Performing key=184:2351:0:7f6d66f7-cfe5-4820-8289-0e47d8c9102b op=p_drbd_mount1:0_notify_0 ) Jun 14 15:42:36 vmhost1 lrmd: [3853]: info: rsc:p_drbd_mount1:0 notify[60] (pid 18177) Jun 14 15:42:36 vmhost1 crmd: [3856]: info: do_lrm_rsc_op: Performing key=200:2351:0:7f6d66f7-cfe5-4820-8289-0e47d8c9102b op=p_drbd_vmstore:0_notify_0 ) Jun 14 15:42:36 vmhost1 lrmd: [3853]: info: rsc:p_drbd_vmstore:0 notify[61] (pid 18179) Jun 14 15:42:36 vmhost1 crmd: [3856]: info: process_lrm_event: LRM operation p_drbd_mount2:0_monitor_10000 (call=53, status=1, cib-update=0, confirmed=true) Cancelled Jun 14 15:42:36 vmhost1 crmd: [3856]: info: process_lrm_event: LRM operation p_drbd_mount1:0_monitor_10000 (call=54, status=1, cib-update=0, confirmed=true) Cancelled Jun 14 15:42:36 vmhost1 crmd: [3856]: info: process_lrm_event: LRM operation p_drbd_vmstore:0_monitor_10000 (call=55, status=1, cib-update=0, confirmed=true) Cancelled Jun 14 15:42:36 vmhost1 crmd: [3856]: info: process_lrm_event: LRM operation p_vm_myvm_monitor_10000 (call=57, status=1, cib-update=0, confirmed=true) Cancelled Jun 14 15:42:36 vmhost1 lrmd: [3853]: info: operation notify[59] on p_drbd_mount2:0 for client 3856: pid 18175 exited with return code 0 Jun 14 15:42:36 vmhost1 lrmd: [3853]: info: operation notify[60] on p_drbd_mount1:0 for client 3856: pid 18177 exited with return code 0 Jun 14 15:42:36 vmhost1 crmd: [3856]: info: send_direct_ack: ACK'ing resource op p_drbd_mount2:0_notify_0 from 168:2351:0:7f6d66f7-cfe5-4820-8289-0e47d8c9102b: lrm_invoke-lrmd-1339706556-56 Jun 14 15:42:36 vmhost1 crmd: [3856]: info: process_lrm_event: LRM operation p_drbd_mount2:0_notify_0 (call=59, rc=0, cib-update=0, confirmed=true) ok Jun 14 15:42:36 vmhost1 crmd: [3856]: info: send_direct_ack: ACK'ing resource op p_drbd_mount1:0_notify_0 from 184:2351:0:7f6d66f7-cfe5-4820-8289-0e47d8c9102b: lrm_invoke-lrmd-1339706556-57 Jun 14 15:42:36 vmhost1 crmd: [3856]: info: process_lrm_event: LRM operation p_drbd_mount1:0_notify_0 (call=60, rc=0, cib-update=0, confirmed=true) ok Jun 14 15:42:36 vmhost1 lrmd: [3853]: info: operation notify[61] on p_drbd_vmstore:0 for client 3856: pid 18179 exited with return code 0 Jun 14 15:42:36 vmhost1 crmd: [3856]: info: send_direct_ack: ACK'ing resource op p_drbd_vmstore:0_notify_0 from 200:2351:0:7f6d66f7-cfe5-4820-8289-0e47d8c9102b: lrm_invoke-lrmd-1339706556-58 Jun 14 15:42:36 vmhost1 crmd: [3856]: info: process_lrm_event: LRM operation p_drbd_vmstore:0_notify_0 (call=61, rc=0, cib-update=0, confirmed=true) ok Jun 14 15:42:36 vmhost1 VirtualDomain[18174]: [18244]: INFO: Issuing graceful shutdown request for domain myvm. Jun 14 15:42:36 vmhost1 lrmd: [3853]: info: RA output: (p_vm_myvm:stop:stdout) Domain myvm is being shutdown Jun 14 15:43:32 vmhost1 attrd: [3855]: notice: attrd_trigger_update: Sending flush op to all hosts for: p_ping (2000) Jun 14 15:43:32 vmhost1 attrd: [3855]: notice: attrd_perform_update: Sent update 165: p_ping=2000 Jun 14 15:43:58 vmhost1 lrmd: [3853]: info: RA output: (p_vm_myvm:stop:stderr) error: operation failed: could not query memory balloon allocation Jun 14 15:43:58 vmhost1 VirtualDomain[18174]: [18899]: INFO: Virtual domain myvm currently has no state, retrying. Jun 14 15:43:59 vmhost1 lrmd: [3853]: info: operation stop[58] on p_vm_myvm for client 3856: pid 18174 exited with return code 0 Jun 14 15:43:59 vmhost1 crmd: [3856]: info: process_lrm_event: LRM operation p_vm_myvm_stop_0 (call=58, rc=0, cib-update=83, confirmed=true) ok Jun 14 15:44:01 vmhost1 crmd: [3856]: info: do_lrm_rsc_op: Performing key=31:2352:8:7f6d66f7-cfe5-4820-8289-0e47d8c9102b op=p_drbd_mount2:0_monitor_10000 ) Jun 14 15:44:01 vmhost1 lrmd: [3853]: info: rsc:p_drbd_mount2:0 monitor[62] (pid 18935) Jun 14 15:44:01 vmhost1 crmd: [3856]: info: do_lrm_rsc_op: Performing key=64:2352:8:7f6d66f7-cfe5-4820-8289-0e47d8c9102b op=p_drbd_mount1:0_monitor_10000 ) Jun 14 15:44:01 vmhost1 lrmd: [3853]: info: rsc:p_drbd_mount1:0 monitor[63] (pid 18936) Jun 14 15:44:01 vmhost1 crmd: [3856]: info: do_lrm_rsc_op: Performing key=97:2352:8:7f6d66f7-cfe5-4820-8289-0e47d8c9102b op=p_drbd_vmstore:0_monitor_10000 ) Jun 14 15:44:01 vmhost1 lrmd: [3853]: info: rsc:p_drbd_vmstore:0 monitor[64] (pid 18937) Jun 14 15:44:01 vmhost1 crmd: [3856]: info: do_lrm_rsc_op: Performing key=128:2352:0:7f6d66f7-cfe5-4820-8289-0e47d8c9102b op=p_vm_myvm_start_0 ) Jun 14 15:44:01 vmhost1 lrmd: [3853]: info: rsc:p_vm_myvm start[65] (pid 18939) Jun 14 15:44:01 vmhost1 lrmd: [3853]: info: operation monitor[62] on p_drbd_mount2:0 for client 3856: pid 18935 exited with return code 8 Jun 14 15:44:01 vmhost1 lrmd: [3853]: info: operation monitor[63] on p_drbd_mount1:0 for client 3856: pid 18936 exited with return code 8 Jun 14 15:44:01 vmhost1 crmd: [3856]: info: process_lrm_event: LRM operation p_drbd_mount2:0_monitor_10000 (call=62, rc=8, cib-update=84, confirmed=false) master Jun 14 15:44:01 vmhost1 lrmd: [3853]: info: operation monitor[64] on p_drbd_vmstore:0 for client 3856: pid 18937 exited with return code 8 Jun 14 15:44:01 vmhost1 crmd: [3856]: info: process_lrm_event: LRM operation p_drbd_mount1:0_monitor_10000 (call=63, rc=8, cib-update=85, confirmed=false) master Jun 14 15:44:01 vmhost1 crmd: [3856]: info: process_lrm_event: LRM operation p_drbd_vmstore:0_monitor_10000 (call=64, rc=8, cib-update=86, confirmed=false) master Jun 14 15:44:01 vmhost1 VirtualDomain[18939]: [19033]: INFO: Domain name "myvm" saved to /var/run/resource-agents/VirtualDomain-p_vm_myvm.state. Jun 14 15:44:02 vmhost1 lrmd: [3853]: info: RA output: (p_vm_myvm:start:stdout) Domain myvm started Jun 14 15:44:02 vmhost1 lrmd: [3853]: info: operation start[65] on p_vm_myvm for client 3856: pid 18939 exited with return code 0 Jun 14 15:44:02 vmhost1 crmd: [3856]: info: process_lrm_event: LRM operation p_vm_myvm_start_0 (call=65, rc=0, cib-update=87, confirmed=true) ok Jun 14 15:44:03 vmhost1 crmd: [3856]: info: do_lrm_rsc_op: Performing key=129:2352:0:7f6d66f7-cfe5-4820-8289-0e47d8c9102b op=p_vm_myvm_monitor_10000 ) Jun 14 15:44:03 vmhost1 lrmd: [3853]: info: rsc:p_vm_myvm monitor[66] (pid 19121) Jun 14 15:44:03 vmhost1 lrmd: [3853]: info: operation monitor[66] on p_vm_myvm for client 3856: pid 19121 exited with return code 0 Jun 14 15:44:03 vmhost1 crmd: [3856]: info: process_lrm_event: LRM operation p_vm_myvm_monitor_10000 (call=66, rc=0, cib-update=88, confirmed=false) ok
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org