Hi David, Thanks for your reply. Just to clear it up:
If everything is running on node-1 and I do a "crm node standby node-1", everything goes to node-2. When I "crm node online node-1" everything is perfectly fine and things do not get disrupted on node-2. The services remain on node-2 until I move it manually, awesome. I think you are on the right track because this only happens from a reboot. If I shutdown pacemaker and corosync services on node-1, everything fails to node-2. When I start the services back up on node-1, nothing get's interrupted on #2. It just comes online. I think this does in fact does have something to do with the reboot (it's just not so graceful). The reason I am testing by hard rebooting the entire server is because I want to test the behavior of pacemaker/drbd/corosync in the event of power failure, or a system becoming frozen or having a kernel panic (I feel like a reboot was a good way to test all 3). ### ### Here are the Corosync log from node-2 right after I hard reset node-1: (Scroll down for the log when node-1 comes back up) ### Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: notice: unpack_config: On loss of CCM Quorum: Ignore Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: determine_online_status: Node node-2.mycompany.com is online Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: group_print: Resource Group: jira_services Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: native_print: drbd1_opt-atlassian (ocf::heartbeat:Filesystem): Stopped Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: native_print: drbd2_var-atlassian (ocf::heartbeat:Filesystem): Stopped Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: native_print: failover-ip (ocf::heartbeat:IPaddr2): Stopped Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: native_print: atlassian_jira (lsb:jira): Stopped Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: clone_print: Master/Slave Set: ms_drbd_data [drbd_data] Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: short_print: Slaves: [ node-2.mycompany.com ] Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: short_print: Stopped: [ node-1.mycompany.com ] Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: native_color: Resource drbd_data:1 cannot run anywhere Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: master_color: Promoting drbd_data:0 (Slave node-2.mycompany.com) Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: master_color: ms_drbd_data: Promoted 1 instances of a possible 1 to master Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: RecurringOp: Start recurring monitor (29s) for drbd_data:0 on node-2.mycompany.com Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: RecurringOp: Cancelling action drbd_data:0_monitor_31000 (Slave vs. Master) Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: RecurringOp: Start recurring monitor (29s) for drbd_data:0 on node-2.mycompany.com Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: RecurringOp: Cancelling action drbd_data:0_monitor_31000 (Slave vs. Master) Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: notice: LogActions: Start drbd1_opt-atlassian (node-2.mycompany.com) Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: notice: LogActions: Start drbd2_var-atlassian (node-2.mycompany.com) Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: notice: LogActions: Start failover-ip (node-2.mycompany.com) Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: notice: LogActions: Start atlassian_jira (node-2.mycompany.com) Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: notice: LogActions: Promote drbd_data:0 (Slave -> Master node-2.mycompany.com) Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: LogActions: Leave drbd_data:1 (Stopped) Jan 21 07:05:12 [1963] node-2.mycompany.com crmd: info: do_te_invoke: Processing graph 18 (ref=pe_calc-dc-1390305912-111) derived from /var/lib/pacemaker/pengine/pe-input-835.bz2 Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: notice: process_pe_message: Calculated Transition 18: /var/lib/pacemaker/pengine/pe-input-835.bz2 Jan 21 07:05:12 [1963] node-2.mycompany.com crmd: notice: run_graph: Transition 18 (Complete=3, Pending=0, Fired=0, Skipped=10, Incomplete=4, Source=/var/lib/pacemaker/pengine/pe-input-835.bz2): Stopped Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: notice: unpack_config: On loss of CCM Quorum: Ignore Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: determine_online_status: Node node-2.mycompany.com is online Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: group_print: Resource Group: jira_services Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: native_print: drbd1_opt-atlassian (ocf::heartbeat:Filesystem): Stopped Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: native_print: drbd2_var-atlassian (ocf::heartbeat:Filesystem): Stopped Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: native_print: failover-ip (ocf::heartbeat:IPaddr2): Stopped Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: native_print: atlassian_jira (lsb:jira): Stopped Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: clone_print: Master/Slave Set: ms_drbd_data [drbd_data] Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: short_print: Slaves: [ node-2.mycompany.com ] Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: short_print: Stopped: [ node-1.mycompany.com ] Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: native_color: Resource drbd_data:1 cannot run anywhere Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: master_color: Promoting drbd_data:0 (Slave node-2.mycompany.com) Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: master_color: ms_drbd_data: Promoted 1 instances of a possible 1 to master Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: RecurringOp: Start recurring monitor (29s) for drbd_data:0 on node-2.mycompany.com Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: RecurringOp: Start recurring monitor (29s) for drbd_data:0 on node-2.mycompany.com Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: notice: LogActions: Start drbd1_opt-atlassian (node-2.mycompany.com) Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: notice: LogActions: Start drbd2_var-atlassian (node-2.mycompany.com) Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: notice: LogActions: Start failover-ip (node-2.mycompany.com) Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: notice: LogActions: Start atlassian_jira (node-2.mycompany.com) Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: notice: LogActions: Promote drbd_data:0 (Slave -> Master node-2.mycompany.com) Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: LogActions: Leave drbd_data:1 (Stopped) Jan 21 07:05:12 [1963] node-2.mycompany.com crmd: info: do_te_invoke: Processing graph 19 (ref=pe_calc-dc-1390305912-116) derived from /var/lib/pacemaker/pengine/pe-input-836.bz2 Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: notice: process_pe_message: Calculated Transition 19: /var/lib/pacemaker/pengine/pe-input-836.bz2 Jan 21 07:05:14 [1963] node-2.mycompany.com crmd: notice: run_graph: Transition 19 (Complete=9, Pending=0, Fired=0, Skipped=7, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-836.bz2): Stopped Jan 21 07:05:14 [1962] node-2.mycompany.com pengine: notice: unpack_config: On loss of CCM Quorum: Ignore Jan 21 07:05:14 [1962] node-2.mycompany.com pengine: info: determine_online_status: Node node-2.mycompany.com is online Jan 21 07:05:14 [1962] node-2.mycompany.com pengine: info: group_print: Resource Group: jira_services Jan 21 07:05:14 [1962] node-2.mycompany.com pengine: info: native_print: drbd1_opt-atlassian (ocf::heartbeat:Filesystem): Stopped #### #### When node-1 has recovered: #### Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: notice: unpack_config: On loss of CCM Quorum: Ignore Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: determine_online_status: Node node-2.mycompany.com is online Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: group_print: Resource Group: jira_services Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: native_print: drbd1_opt-atlassian (ocf::heartbeat:Filesystem): Stopped Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: native_print: drbd2_var-atlassian (ocf::heartbeat:Filesystem): Stopped Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: native_print: failover-ip (ocf::heartbeat:IPaddr2): Stopped Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: native_print: atlassian_jira (lsb:jira): Stopped Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: clone_print: Master/Slave Set: ms_drbd_data [drbd_data] Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: short_print: Slaves: [ node-2.mycompany.com ] Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: short_print: Stopped: [ node-1.mycompany.com ] Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: native_color: Resource drbd_data:1 cannot run anywhere Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: master_color: Promoting drbd_data:0 (Slave node-2.mycompany.com) Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: master_color: ms_drbd_data: Promoted 1 instances of a possible 1 to master Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: RecurringOp: Start recurring monitor (29s) for drbd_data:0 on node-2.mycompany.com Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: RecurringOp: Cancelling action drbd_data:0_monitor_31000 (Slave vs. Master) Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: RecurringOp: Start recurring monitor (29s) for drbd_data:0 on node-2.mycompany.com Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: RecurringOp: Cancelling action drbd_data:0_monitor_31000 (Slave vs. Master) Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: notice: LogActions: Start drbd1_opt-atlassian (node-2.mycompany.com) Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: notice: LogActions: Start drbd2_var-atlassian (node-2.mycompany.com) Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: notice: LogActions: Start failover-ip (node-2.mycompany.com) Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: notice: LogActions: Start atlassian_jira (node-2.mycompany.com) Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: notice: LogActions: Promote drbd_data:0 (Slave -> Master node-2.mycompany.com) Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: LogActions: Leave drbd_data:1 (Stopped) Jan 21 07:05:12 [1963] node-2.mycompany.com crmd: info: do_te_invoke: Processing graph 18 (ref=pe_calc-dc-1390305912-111) derived from /var/lib/pacemaker/pengine/pe-input-835.bz2 Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: notice: process_pe_message: Calculated Transition 18: /var/lib/pacemaker/pengine/pe-input-835.bz2 Jan 21 07:05:12 [1963] node-2.mycompany.com crmd: notice: run_graph: Transition 18 (Complete=3, Pending=0, Fired=0, Skipped=10, Incomplete=4, Source=/var/lib/pacemaker/pengine/pe-input-835.bz2): Stopped Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: notice: unpack_config: On loss of CCM Quorum: Ignore Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: determine_online_status: Node node-2.mycompany.com is online Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: group_print: Resource Group: jira_services Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: native_print: drbd1_opt-atlassian (ocf::heartbeat:Filesystem): Stopped Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: native_print: drbd2_var-atlassian (ocf::heartbeat:Filesystem): Stopped Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: native_print: failover-ip (ocf::heartbeat:IPaddr2): Stopped Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: native_print: atlassian_jira (lsb:jira): Stopped Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: clone_print: Master/Slave Set: ms_drbd_data [drbd_data] Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: short_print: Slaves: [ node-2.mycompany.com ] Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: short_print: Stopped: [ node-1.mycompany.com ] Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: native_color: Resource drbd_data:1 cannot run anywhere Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: master_color: Promoting drbd_data:0 (Slave node-2.mycompany.com) Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: master_color: ms_drbd_data: Promoted 1 instances of a possible 1 to master Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: RecurringOp: Start recurring monitor (29s) for drbd_data:0 on node-2.mycompany.com Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: RecurringOp: Start recurring monitor (29s) for drbd_data:0 on node-2.mycompany.com Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: notice: LogActions: Start drbd1_opt-atlassian (node-2.mycompany.com) Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: notice: LogActions: Start drbd2_var-atlassian (node-2.mycompany.com) Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: notice: LogActions: Start failover-ip (node-2.mycompany.com) Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: notice: LogActions: Start atlassian_jira (node-2.mycompany.com) Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: notice: LogActions: Promote drbd_data:0 (Slave -> Master node-2.mycompany.com) Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: info: LogActions: Leave drbd_data:1 (Stopped) Jan 21 07:05:12 [1963] node-2.mycompany.com crmd: info: do_te_invoke: Processing graph 19 (ref=pe_calc-dc-1390305912-116) derived from /var/lib/pacemaker/pengine/pe-input-836.bz2 Jan 21 07:05:12 [1962] node-2.mycompany.com pengine: notice: process_pe_message: Calculated Transition 19: /var/lib/pacemaker/pengine/pe-input-836.bz2 Jan 21 07:05:14 [1963] node-2.mycompany.com crmd: notice: run_graph: Transition 19 (Complete=9, Pending=0, Fired=0, Skipped=7, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-836.bz2): Stopped Jan 21 07:05:14 [1962] node-2.mycompany.com pengine: notice: unpack_config: On loss of CCM Quorum: Ignore Jan 21 07:05:14 [1962] node-2.mycompany.com pengine: info: determine_online_status: Node node-2.mycompany.com is online Jan 21 07:05:14 [1962] node-2.mycompany.com pengine: info: group_print: Resource Group: jira_services Jan 21 07:05:14 [1962] node-2.mycompany.com pengine: info: native_print: drbd1_opt-atlassian (ocf::heartbeat:Filesystem): Stopped Jan 21 07:05:14 [1962] node-2.mycompany.com pengine: info: native_print: drbd2_var-atlassian (ocf::heartbeat:Filesystem): Stopped Jan 21 07:05:14 [1962] node-2.mycompany.com pengine: info: native_print: failover-ip (ocf::heartbeat:IPaddr2): Stopped Jan 21 07:05:14 [1962] node-2.mycompany.com pengine: info: native_print: atlassian_jira (lsb:jira): Stopped Jan 21 07:05:14 [1962] node-2.mycompany.com pengine: info: clone_print: Master/Slave Set: ms_drbd_data [drbd_data] Jan 21 07:05:14 [1962] node-2.mycompany.com pengine: info: short_print: Masters: [ node-2.mycompany.com ] Jan 21 07:05:14 [1962] node-2.mycompany.com pengine: info: short_print: Stopped: [ node-1.mycompany.com ] Jan 21 07:05:14 [1962] node-2.mycompany.com pengine: info: native_color: Resource drbd_data:1 cannot run anywhere Jan 21 07:05:14 [1962] node-2.mycompany.com pengine: info: master_color: Promoting drbd_data:0 (Master node-2.mycompany.com) Jan 21 07:05:14 [1962] node-2.mycompany.com pengine: info: master_color: ms_drbd_data: Promoted 1 instances of a possible 1 to master Jan 21 07:05:14 [1962] node-2.mycompany.com pengine: info: RecurringOp: Start recurring monitor (29s) for drbd_data:0 on node-2.mycompany.com Jan 21 07:05:14 [1962] node-2.mycompany.com pengine: info: RecurringOp: Start recurring monitor (29s) for drbd_data:0 on node-2.mycompany.com Jan 21 07:05:14 [1962] node-2.mycompany.com pengine: notice: LogActions: Start drbd1_opt-atlassian (node-2.mycompany.com) Jan 21 07:05:14 [1962] node-2.mycompany.com pengine: notice: LogActions: Start drbd2_var-atlassian (node-2.mycompany.com) Jan 21 07:05:14 [1962] node-2.mycompany.com pengine: notice: LogActions: Start failover-ip (node-2.mycompany.com) Jan 21 07:05:14 [1962] node-2.mycompany.com pengine: notice: LogActions: Start atlassian_jira (node-2.mycompany.com) Jan 21 07:05:14 [1962] node-2.mycompany.com pengine: info: LogActions: Leave drbd_data:0 (Master node-2.mycompany.com) Jan 21 07:05:14 [1962] node-2.mycompany.com pengine: info: LogActions: Leave drbd_data:1 (Stopped) Jan 21 07:05:14 [1963] node-2.mycompany.com crmd: info: do_te_invoke: Processing graph 20 (ref=pe_calc-dc-1390305914-122) derived from /var/lib/pacemaker/pengine/pe-input-837.bz2 Jan 21 07:05:14 [1962] node-2.mycompany.com pengine: notice: process_pe_message: Calculated Transition 20: /var/lib/pacemaker/pengine/pe-input-837.bz2 Jan 21 07:05:16 [1963] node-2.mycompany.com crmd: notice: run_graph: Transition 20 (Complete=7, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-837.bz2): Complete Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: notice: unpack_config: On loss of CCM Quorum: Ignore Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: info: determine_online_status: Node node-2.mycompany.com is online Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: info: determine_online_status: Node node-1.mycompany.com is online Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: info: group_print: Resource Group: jira_services Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: info: native_print: drbd1_opt-atlassian (ocf::heartbeat:Filesystem): Started node-2.mycompany.com Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: info: native_print: drbd2_var-atlassian (ocf::heartbeat:Filesystem): Started node-2.mycompany.com Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: info: native_print: failover-ip (ocf::heartbeat:IPaddr2): Started node-2.mycompany.com Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: info: native_print: atlassian_jira (lsb:jira): Started node-2.mycompany.com Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: info: clone_print: Master/Slave Set: ms_drbd_data [drbd_data] Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: info: short_print: Masters: [ node-2.mycompany.com ] Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: info: short_print: Stopped: [ node-1.mycompany.com ] Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: info: master_color: Promoting drbd_data:0 (Master node-2.mycompany.com) Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: info: master_color: ms_drbd_data: Promoted 1 instances of a possible 1 to master Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: info: RecurringOp: Start recurring monitor (31s) for drbd_data:1 on node-1.mycompany.com Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: info: RecurringOp: Start recurring monitor (31s) for drbd_data:1 on node-1.mycompany.com Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: info: LogActions: Leave drbd1_opt-atlassian (Started node-2.mycompany.com) Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: info: LogActions: Leave drbd2_var-atlassian (Started node-2.mycompany.com) Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: info: LogActions: Leave failover-ip (Started node-2.mycompany.com) Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: info: LogActions: Leave atlassian_jira (Started node-2.mycompany.com) Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: info: LogActions: Leave drbd_data:0 (Master node-2.mycompany.com) Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: notice: LogActions: Start drbd_data:1 (node-1.mycompany.com) Jan 21 07:05:49 [1963] node-2.mycompany.com crmd: info: do_te_invoke: Processing graph 21 (ref=pe_calc-dc-1390305949-137) derived from /var/lib/pacemaker/pengine/pe-input-838.bz2 Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: notice: process_pe_message: Calculated Transition 21: /var/lib/pacemaker/pengine/pe-input-838.bz2 Jan 21 07:05:49 [1963] node-2.mycompany.com crmd: notice: run_graph: Transition 21 (Complete=10, Pending=0, Fired=0, Skipped=3, Incomplete=5, Source=/var/lib/pacemaker/pengine/pe-input-838.bz2): Stopped Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: notice: unpack_config: On loss of CCM Quorum: Ignore Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: info: determine_online_status: Node node-2.mycompany.com is online Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: info: determine_online_status: Node node-1.mycompany.com is online Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: info: unpack_rsc_op: Operation monitor found resource drbd2_var-atlassian active on node-1.mycompany.com Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: info: unpack_rsc_op: Operation monitor found resource drbd1_opt-atlassian active on node-1.mycompany.com Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: info: group_print: Resource Group: jira_services Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: info: native_print: drbd1_opt-atlassian (ocf::heartbeat:Filesystem): Started Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: info: native_print: 1 : node-2.mycompany.com Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: info: native_print: 2 : node-1.mycompany.com Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: info: native_print: drbd2_var-atlassian (ocf::heartbeat:Filesystem): Started Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: info: native_print: 1 : node-2.mycompany.com Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: info: native_print: 2 : node-1.mycompany.com Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: info: native_print: failover-ip (ocf::heartbeat:IPaddr2): Started node-2.mycompany.com Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: info: native_print: atlassian_jira (lsb:jira): Started node-2.mycompany.com Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: info: clone_print: Master/Slave Set: ms_drbd_data [drbd_data] Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: info: short_print: Masters: [ node-2.mycompany.com ] Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: info: short_print: Stopped: [ node-1.mycompany.com ] Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: info: master_color: Promoting drbd_data:0 (Master node-2.mycompany.com) Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: info: master_color: ms_drbd_data: Promoted 1 instances of a possible 1 to master Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: error: native_create_actions: Resource drbd1_opt-atlassian (ocf::Filesystem) is active on 2 nodes attempting recovery Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: warning: native_create_actions: See http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information. Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: error: native_create_actions: Resource drbd2_var-atlassian (ocf::Filesystem) is active on 2 nodes attempting recovery Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: warning: native_create_actions: See http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information. Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: info: RecurringOp: Start recurring monitor (31s) for drbd_data:1 on node-1.mycompany.com Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: info: RecurringOp: Start recurring monitor (31s) for drbd_data:1 on node-1.mycompany.com Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: notice: LogActions: Restart drbd1_opt-atlassian (Started node-2.mycompany.com) Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: notice: LogActions: Restart drbd2_var-atlassian (Started node-2.mycompany.com) Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: notice: LogActions: Restart failover-ip (Started node-2.mycompany.com) Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: notice: LogActions: Restart atlassian_jira (Started node-2.mycompany.com) Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: info: LogActions: Leave drbd_data:0 (Master node-2.mycompany.com) Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: notice: LogActions: Start drbd_data:1 (node-1.mycompany.com) Jan 21 07:05:49 [1963] node-2.mycompany.com crmd: info: do_te_invoke: Processing graph 22 (ref=pe_calc-dc-1390305949-146) derived from /var/lib/pacemaker/pengine/pe-error-168.bz2 Jan 21 07:05:49 [1962] node-2.mycompany.com pengine: error: process_pe_message: Calculated Transition 22: /var/lib/pacemaker/pengine/pe-error-168.bz2 Jan 21 07:06:11 [1963] node-2.mycompany.com crmd: notice: run_graph: Transition 22 (Complete=14, Pending=0, Fired=0, Skipped=13, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-error-168.bz2): Stopped Jan 21 07:06:11 [1962] node-2.mycompany.com pengine: notice: unpack_config: On loss of CCM Quorum: Ignore Jan 21 07:06:11 [1962] node-2.mycompany.com pengine: info: determine_online_status: Node node-2.mycompany.com is online Jan 21 07:06:11 [1962] node-2.mycompany.com pengine: info: determine_online_status: Node node-1.mycompany.com is online Jan 21 07:06:11 [1962] node-2.mycompany.com pengine: info: unpack_rsc_op: Operation monitor found resource drbd2_var-atlassian active on node-1.mycompany.com Jan 21 07:06:11 [1962] node-2.mycompany.com pengine: info: unpack_rsc_op: Operation monitor found resource drbd1_opt-atlassian active on node-1.mycompany.com Jan 21 07:06:11 [1962] node-2.mycompany.com pengine: info: group_print: Resource Group: jira_services Jan 21 07:06:11 [1962] node-2.mycompany.com pengine: info: native_print: drbd1_opt-atlassian (ocf::heartbeat:Filesystem): Started Jan 21 07:06:11 [1962] node-2.mycompany.com pengine: info: native_print: 1 : node-2.mycompany.com Jan 21 07:06:11 [1962] node-2.mycompany.com pengine: info: native_print: 2 : node-1.mycompany.com Jan 21 07:06:11 [1962] node-2.mycompany.com pengine: info: native_print: drbd2_var-atlassian (ocf::heartbeat:Filesystem): Started Jan 21 07:06:11 [1962] node-2.mycompany.com pengine: info: native_print: 1 : node-2.mycompany.com Jan 21 07:06:11 [1962] node-2.mycompany.com pengine: info: native_print: 2 : node-1.mycompany.com Jan 21 07:06:11 [1962] node-2.mycompany.com pengine: info: native_print: failover-ip (ocf::heartbeat:IPaddr2): Started node-2.mycompany.com Jan 21 07:06:11 [1962] node-2.mycompany.com pengine: info: native_print: atlassian_jira (lsb:jira): Stopped Jan 21 07:06:11 [1962] node-2.mycompany.com pengine: info: clone_print: Master/Slave Set: ms_drbd_data [drbd_data] Jan 21 07:06:11 [1962] node-2.mycompany.com pengine: info: short_print: Masters: [ node-2.mycompany.com ] Jan 21 07:06:11 [1962] node-2.mycompany.com pengine: info: short_print: Slaves: [ node-1.mycompany.com ] Jan 21 07:06:11 [1962] node-2.mycompany.com pengine: info: master_color: Promoting drbd_data:0 (Master node-2.mycompany.com) Jan 21 07:06:11 [1962] node-2.mycompany.com pengine: info: master_color: ms_drbd_data: Promoted 1 instances of a possible 1 to master Jan 21 07:06:11 [1962] node-2.mycompany.com pengine: error: native_create_actions: Resource drbd1_opt-atlassian (ocf::Filesystem) is active on 2 nodes attempting recovery Jan 21 07:06:11 [1962] node-2.mycompany.com pengine: warning: native_create_actions: See http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information. Jan 21 07:06:11 [1962] node-2.mycompany.com pengine: error: native_create_actions: Resource drbd2_var-atlassian (ocf::Filesystem) is active on 2 nodes attempting recovery Jan 21 07:06:11 [1962] node-2.mycompany.com pengine: warning: native_create_actions: See http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information. Jan 21 07:06:11 [1962] node-2.mycompany.com pengine: notice: LogActions: Restart drbd1_opt-atlassian (Started node-2.mycompany.com) Jan 21 07:06:11 [1962] node-2.mycompany.com pengine: notice: LogActions: Restart drbd2_var-atlassian (Started node-2.mycompany.com) Jan 21 07:06:11 [1962] node-2.mycompany.com pengine: notice: LogActions: Restart failover-ip (Started node-2.mycompany.com) Jan 21 07:06:11 [1962] node-2.mycompany.com pengine: notice: LogActions: Start atlassian_jira (node-2.mycompany.com) Jan 21 07:06:11 [1962] node-2.mycompany.com pengine: info: LogActions: Leave drbd_data:0 (Master node-2.mycompany.com) Jan 21 07:06:11 [1962] node-2.mycompany.com pengine: info: LogActions: Leave drbd_data:1 (Slave node-1.mycompany.com) Jan 21 07:06:11 [1963] node-2.mycompany.com crmd: info: do_te_invoke: Processing graph 23 (ref=pe_calc-dc-1390305971-156) derived from /var/lib/pacemaker/pengine/pe-error-169.bz2 Jan 21 07:06:11 [1962] node-2.mycompany.com pengine: error: process_pe_message: Calculated Transition 23: /var/lib/pacemaker/pengine/pe-error-169.bz2 Jan 21 07:06:15 [1963] node-2.mycompany.com crmd: notice: run_graph: Transition 23 (Complete=14, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-error-169.bz2): Complete Jan 21 07:06:21 [1962] node-2.mycompany.com pengine: notice: unpack_config: On loss of CCM Quorum: Ignore Jan 21 07:06:21 [1962] node-2.mycompany.com pengine: info: determine_online_status: Node node-2.mycompany.com is online Jan 21 07:06:21 [1962] node-2.mycompany.com pengine: info: determine_online_status: Node node-1.mycompany.com is online Jan 21 07:06:21 [1962] node-2.mycompany.com pengine: info: unpack_rsc_op: Operation monitor found resource drbd2_var-atlassian active on node-1.mycompany.com Jan 21 07:06:21 [1962] node-2.mycompany.com pengine: info: unpack_rsc_op: Operation monitor found resource drbd1_opt-atlassian active on node-1.mycompany.com Jan 21 07:06:21 [1962] node-2.mycompany.com pengine: info: group_print: Resource Group: jira_services Jan 21 07:06:21 [1962] node-2.mycompany.com pengine: info: native_print: drbd1_opt-atlassian (ocf::heartbeat:Filesystem): Started node-2.mycompany.com Jan 21 07:06:21 [1962] node-2.mycompany.com pengine: info: native_print: drbd2_var-atlassian (ocf::heartbeat:Filesystem): Started node-2.mycompany.com Jan 21 07:06:21 [1962] node-2.mycompany.com pengine: info: native_print: failover-ip (ocf::heartbeat:IPaddr2): Started node-2.mycompany.com Jan 21 07:06:21 [1962] node-2.mycompany.com pengine: info: native_print: atlassian_jira (lsb:jira): Started node-2.mycompany.com Jan 21 07:06:21 [1962] node-2.mycompany.com pengine: info: clone_print: Master/Slave Set: ms_drbd_data [drbd_data] Jan 21 07:06:21 [1962] node-2.mycompany.com pengine: info: short_print: Masters: [ node-2.mycompany.com ] Jan 21 07:06:21 [1962] node-2.mycompany.com pengine: info: short_print: Slaves: [ node-1.mycompany.com ] Jan 21 07:06:21 [1962] node-2.mycompany.com pengine: info: master_color: Promoting drbd_data:0 (Master node-2.mycompany.com) Jan 21 07:06:21 [1962] node-2.mycompany.com pengine: info: master_color: ms_drbd_data: Promoted 1 instances of a possible 1 to master Jan 21 07:06:21 [1962] node-2.mycompany.com pengine: info: LogActions: Leave drbd1_opt-atlassian (Started node-2.mycompany.com) Jan 21 07:06:21 [1962] node-2.mycompany.com pengine: info: LogActions: Leave drbd2_var-atlassian (Started node-2.mycompany.com) Jan 21 07:06:21 [1962] node-2.mycompany.com pengine: info: LogActions: Leave failover-ip (Started node-2.mycompany.com) Jan 21 07:06:21 [1962] node-2.mycompany.com pengine: info: LogActions: Leave atlassian_jira (Started node-2.mycompany.com) Jan 21 07:06:21 [1962] node-2.mycompany.com pengine: info: LogActions: Leave drbd_data:0 (Master node-2.mycompany.com) Jan 21 07:06:21 [1962] node-2.mycompany.com pengine: info: LogActions: Leave drbd_data:1 (Slave node-1.mycompany.com) Jan 21 07:06:21 [1963] node-2.mycompany.com crmd: info: do_te_invoke: Processing graph 24 (ref=pe_calc-dc Thanks again, David. Mike. ----- Original Message ----- From: "David Vossel" <dvos...@redhat.com> To: "The Pacemaker cluster resource manager" <pacemaker@oss.clusterlabs.org> Sent: Tuesday, January 21, 2014 10:26:45 AM Subject: Re: [Pacemaker] Preventing Automatic Failback ----- Original Message ----- > From: "Michael Monette" <mmone...@2keys.ca> > To: pacemaker@oss.clusterlabs.org > Sent: Monday, January 20, 2014 8:22:25 AM > Subject: [Pacemaker] Preventing Automatic Failback > > Hi, > > I posted this question before but my question was a bit unclear. > > I have 2 nodes with DRBD with Postgresql. > > When node-1 fails, everything fails to node-2 . But when node 1 is recovered, > things try to failback to node-1 and all the services running on node-2 get > disrupted(things don't ACTUALLY fail back to node-1..they try, fail, and > then all services on node-2 are simply restarted..very annoying). This does > not happen if I perform the same tests on node-2! I can reboot node-2, > things fail to node-1 and node-2 comes online and waits until he is > needed(this is what I want!) It seems to only affect my node-1's. > > I have tried to set resource stickiness, I have tried everything I can really > think of, but whenever the Primary has recovered, it will always disrupt > services running on node-2. > > Also I tried removing things from this config to try and isolate this. At one > point I removed the atlassian_jira and drbd2_var primitives and only had a > failover-ip and drbd1_opt, but still had the same problem. Hopefully someone > can pinpoint this out for me. If I can't really avoid this, I would at least > like to make this "bug" or whatever happen on node-2 instead of the actives. I bet this is due to the drbd resource's master score value on node1 being higher than node2. When you recover node1, are you actually rebooting that node? If node1 doesn't lose membership from the cluster (reboot), those transient attributes that the drbd agent uses to specify which node will be the master instance will stick around. Otherwise if you are just putting node1 in standby and then bringing the node back online, the I believe the resources will come back if the drbd master was originally on node1. If you provide a policy engine file that shows the unwanted transition from node2 back to node1, we'll be able to tell you exactly why it is occurring. -- Vossel > > Here is my config: > > node node-1.comp.com \ > attributes standby="off" > node node-1.comp.com \ > attributes standby="off" > primitive atlassian_jira lsb:jira \ > op start interval="0" timeout="240" \ > op stop interval="0" timeout="240" > primitive drbd1_opt ocf:heartbeat:Filesystem \ > params device="/dev/drbd1" directory="/opt/atlassian" fstype="ext4" > primitive drbd2_var ocf:heartbeat:Filesystem \ > params device="/dev/drbd2" directory="/var/atlassian" fstype="ext4" > primitive drbd_data ocf:linbit:drbd \ > params drbd_resource="r0" \ > op monitor interval="29s" role="Master" \ > op monitor interval="31s" role="Slave" > primitive failover-ip ocf:heartbeat:IPaddr2 \ > params ip="10.199.0.13" > group jira_services drbd1_opt drbd2_var failover-ip atlassian_jira > ms ms_drbd_data drbd_data \ > meta master-max="1" master-node-max="1" clone-max="2" > clone-node-max="1" notify="true" > colocation jira_services_on_drbd inf: atlassian_jira ms_drbd_data:Master > order jira_services_after_drbd inf: ms_drbd_data:promote jira_services:start > property $id="cib-bootstrap-options" \ > dc-version="1.1.10-14.el6_5.1-368c726" \ > cluster-infrastructure="classic openais (with plugin)" \ > expected-quorum-votes="2" \ > stonith-enabled="false" \ > no-quorum-policy="ignore" \ > last-lrm-refresh="1390183165" \ > default-resource-stickiness="INFINITY" > rsc_defaults $id="rsc-options" \ > resource-stickiness="INFINITY" > > Thanks > > Mike > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org