Hi Everyone, I have experience a weird issue last night where our cluster try to failover due to an "Unkown interface"
Look like when the IPaddr2 monitor try to perform a status on eth0, it didn't find the device. Both node are VM. I haven't found any reason as why eth0 would have "disapear" <LOG NODE1> Sep 29 21:25:04 node-01 IPaddr2(vip_v207_174)[4082]: ERROR: Unknown interface [eth0] No such device. Sep 29 21:25:04 node-01 IPaddr2(vip_v207_174)[4082]: ERROR: [findif] failed Sep 29 21:25:05 node-01 crmd[3369]: notice: process_lrm_event: Operation vip_v207_174_monitor_10000: not configured (node=node-01, call=91, rc=6, cib-update=73, confirmed=false) Sep 29 21:25:06 node-01 attrd[3367]: notice: attrd_cs_dispatch: Update relayed from node-02 Sep 29 21:25:06 node-01 attrd[3367]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-vip_v207_174 (2) Sep 29 21:25:06 node-01 attrd[3367]: notice: attrd_perform_update: Sent update 41: fail-count-vip_v207_174=2 Sep 29 21:25:06 node-01 attrd[3367]: notice: attrd_cs_dispatch: Update relayed from node-02 Sep 29 21:25:06 node-01 attrd[3367]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-vip_v207_174 (1443576306) Sep 29 21:25:06 node-01 attrd[3367]: notice: attrd_perform_update: Sent update 43: last-failure-vip_v207_174=1443576306 Sep 29 21:25:07 node-01 crmd[3369]: notice: process_lrm_event: Operation fwcorp-mailto-sysadmin_stop_0: ok (node=node-01, call=110, rc=0, cib-update=74, confirmed=true) Sep 29 21:25:07 node-01 crmd[3369]: notice: process_lrm_event: Operation change-default-fw_stop_0: ok (node=node-01, call=112, rc=0, cib-update=75, confirmed=true) Sep 29 21:25:07 node-01 IPaddr2(vip_v254_230)[4259]: INFO: IP status = ok, IP_CIP= Sep 29 21:25:07 node-01 crmd[3369]: notice: process_lrm_event: Operation vip_v254_230_stop_0: ok (node=node-01, call=114, rc=0, cib-update=76, confirmed=true) Sep 29 21:25:07 node-01 IPaddr2(vip_v27_1)[4313]: INFO: IP status = ok, IP_CIP= Sep 29 21:25:07 node-01 crmd[3369]: notice: process_lrm_event: Operation vip_v27_1_stop_0: ok (node=node-01, call=116, rc=0, cib-update=77, confirmed=true) Sep 29 21:25:07 node-01 IPaddr2(vip_v26_1)[4366]: INFO: IP status = ok, IP_CIP= Sep 29 21:25:07 node-01 crmd[3369]: notice: process_lrm_event: Operation vip_v26_1_stop_0: ok (node=node-01, call=118, rc=0, cib-update=78, confirmed=true) Sep 29 21:25:07 node-01 IPaddr2(vip_v207_174)[4419]: INFO: IP status = ok, IP_CIP= Sep 29 21:25:07 node-01 crmd[3369]: notice: process_lrm_event: Operation vip_v207_174_stop_0: ok (node=node-01, call=120, rc=0, cib-update=79, confirmed=true) </LOG NODE1> <LOG NODE2> Sep 29 21:22:48 node-02 crmd[3241]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ] Sep 29 21:22:48 node-02 pengine[3240]: notice: update_validation: pacemaker-1.2-style configuration is also valid for pacemaker-1.3 Sep 29 21:22:48 node-02 pengine[3240]: notice: update_validation: Upgrading pacemaker-1.3-style configuration to pacemaker-2.0 with upgrade-1.3.xsl Sep 29 21:22:48 node-02 pengine[3240]: notice: update_validation: Transformed the configuration from pacemaker-1.2 to pacemaker-2.0 Sep 29 21:22:48 node-02 pengine[3240]: notice: unpack_config: On loss of CCM Quorum: Ignore Sep 29 21:22:48 node-02 crmd[3241]: notice: run_graph: Transition 14769 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-786.bz2): Complete Sep 29 21:22:48 node-02 crmd[3241]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Sep 29 21:22:48 node-02 pengine[3240]: notice: process_pe_message: Calculated Transition 14769: /var/lib/pacemaker/pengine/pe-input-786.bz2 Sep 29 21:25:06 node-02 crmd[3241]: warning: update_failcount: Updating failcount for vip_v207_174 on node-01 after failed monitor: rc=6 (update=value++, time=1443576306) Sep 29 21:25:06 node-02 crmd[3241]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Sep 29 21:25:06 node-02 pengine[3240]: notice: update_validation: pacemaker-1.2-style configuration is also valid for pacemaker-1.3 Sep 29 21:25:06 node-02 pengine[3240]: notice: update_validation: Upgrading pacemaker-1.3-style configuration to pacemaker-2.0 with upgrade-1.3.xsl Sep 29 21:25:06 node-02 pengine[3240]: notice: update_validation: Transformed the configuration from pacemaker-1.2 to pacemaker-2.0 Sep 29 21:25:06 node-02 pengine[3240]: notice: unpack_config: On loss of CCM Quorum: Ignore Sep 29 21:25:06 node-02 pengine[3240]: warning: unpack_rsc_op_failure: Processing failed op monitor for vip_v207_174 on node-01: not configured (6) Sep 29 21:25:06 node-02 pengine[3240]: error: unpack_rsc_op: Preventing vip_v207_174 from re-starting anywhere: operation monitor failed 'not configured' (6) Sep 29 21:25:06 node-02 pengine[3240]: notice: LogActions: Stop vip_v207_174#011(node-01) Sep 29 21:25:06 node-02 pengine[3240]: notice: LogActions: Stop vip_v26_1#011(node-01) Sep 29 21:25:06 node-02 pengine[3240]: notice: LogActions: Stop vip_v27_1#011(node-01) Sep 29 21:25:06 node-02 pengine[3240]: notice: LogActions: Stop vip_v254_230#011(node-01) Sep 29 21:25:06 node-02 pengine[3240]: notice: LogActions: Stop change-default-fw#011(node-01) Sep 29 21:25:06 node-02 pengine[3240]: notice: LogActions: Stop fwcorp-mailto-sysadmin#011(node-01) Sep 29 21:25:06 node-02 pengine[3240]: notice: process_pe_message: Calculated Transition 14770: /var/lib/pacemaker/pengine/pe-input-787.bz2 Sep 29 21:25:06 node-02 crmd[3241]: notice: te_rsc_command: Initiating action 16: stop fwcorp-mailto-sysadmin_stop_0 on node-01 Sep 29 21:25:06 node-02 crmd[3241]: notice: abort_transition_graph: Transition aborted by status-node-01-fail-count-vip_v207_174, fail-count-vip_v207_174=2: Transient attribute change (modify cib=0.94.107, source=te_update_diff:391, path=/cib/status/node_state[@id='node-01']/transient_attributes[@id='node-01']/instance_attributes[@id='status-node-01']/nvpair[@id='status-node-01-fail-count-vip_v207_174'], 0) Sep 29 21:25:07 node-02 crmd[3241]: notice: run_graph: Transition 14770 (Complete=2, Pending=0, Fired=0, Skipped=7, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-787.bz2): Stopped Sep 29 21:25:07 node-02 pengine[3240]: notice: update_validation: pacemaker-1.2-style configuration is also valid for pacemaker-1.3 Sep 29 21:25:07 node-02 pengine[3240]: notice: update_validation: Upgrading pacemaker-1.3-style configuration to pacemaker-2.0 with upgrade-1.3.xsl Sep 29 21:25:07 node-02 pengine[3240]: notice: update_validation: Transformed the configuration from pacemaker-1.2 to pacemaker-2.0 Sep 29 21:25:07 node-02 pengine[3240]: notice: unpack_config: On loss of CCM Quorum: Ignore Sep 29 21:25:07 node-02 pengine[3240]: warning: unpack_rsc_op_failure: Processing failed op monitor for vip_v207_174 on node-01: not configured (6) Sep 29 21:25:07 node-02 pengine[3240]: error: unpack_rsc_op: Preventing vip_v207_174 from re-starting anywhere: operation monitor failed 'not configured' (6) Sep 29 21:25:07 node-02 pengine[3240]: notice: LogActions: Stop vip_v207_174#011(node-01) Sep 29 21:25:07 node-02 pengine[3240]: notice: LogActions: Stop vip_v26_1#011(node-01) Sep 29 21:25:07 node-02 pengine[3240]: notice: LogActions: Stop vip_v27_1#011(node-01) Sep 29 21:25:07 node-02 pengine[3240]: notice: LogActions: Stop vip_v254_230#011(node-01) Sep 29 21:25:07 node-02 pengine[3240]: notice: LogActions: Stop change-default-fw#011(node-01) Sep 29 21:25:07 node-02 crmd[3241]: notice: te_rsc_command: Initiating action 14: stop change-default-fw_stop_0 on node-01 Sep 29 21:25:07 node-02 pengine[3240]: notice: process_pe_message: Calculated Transition 14771: /var/lib/pacemaker/pengine/pe-input-788.bz2 Sep 29 21:25:07 node-02 crmd[3241]: notice: te_rsc_command: Initiating action 13: stop vip_v254_230_stop_0 on node-01 Sep 29 21:25:07 node-02 crmd[3241]: notice: te_rsc_command: Initiating action 12: stop vip_v27_1_stop_0 on node-01 Sep 29 21:25:07 node-02 crmd[3241]: notice: te_rsc_command: Initiating action 11: stop vip_v26_1_stop_0 on node-01 Sep 29 21:25:07 node-02 crmd[3241]: notice: te_rsc_command: Initiating action 3: stop vip_v207_174_stop_0 on node-01 Sep 29 21:25:07 node-02 crmd[3241]: notice: run_graph: Transition 14771 (Complete=8, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-788.bz2): Complete Sep 29 21:25:07 node-02 crmd[3241]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] </LOG NODE2> I know that I found some post that say to run sysctl -w net.ipv4.conf.all.promote_secondaries=1 to avoid secondary nic to be remove when primary is gone, but in this case the eth0 has a single nic that is manage through IPaddr2 within crm configuration Here's the configuration or node: <CONFIGURATION> Cluster Name: nodecluster1 Corosync Nodes: node-01 node-02 Pacemaker Nodes: node-01 node-02 Resources: Group: lbpcivip Resource: vip_v207_174 (class=ocf provider=heartbeat type=IPaddr2) Attributes: ip=x.x.x.174 cidr_netmask=27 broadcast=x.x.x.191 nic=eth0 Operations: monitor interval=10s (vip_v207_174-monitor-interval-10s) Resource: vip_v26_1 (class=ocf provider=heartbeat type=IPaddr2) Attributes: ip=x.x.26.1 Operations: monitor interval=10s (vip_v26_1-monitor-interval-10s) Resource: vip_v27_1 (class=ocf provider=heartbeat type=IPaddr2) Attributes: ip=x.x.27.1 Operations: monitor interval=10s (vip_v27_1-monitor-interval-10s) Resource: vip_v254_230 (class=ocf provider=heartbeat type=IPaddr2) Attributes: ip=x.x.254.230 Operations: monitor interval=10s (vip_v254_230-monitor-interval-10s) Resource: change-default-fw (class=lsb type=fwdefaultgw) Operations: monitor interval=60s (change-default-fw-monitor-interval-60s) Resource: fwcorp-mailto-sysadmin (class=ocf provider=heartbeat type=MailTo) Attributes: email=i...@touchtunes.com subject="[node - Clustered services]" Operations: monitor interval=60s (fwcorp-mailto-sysadmin-monitor-interval-60s) Stonith Devices: Fencing Levels: Location Constraints: Ordering Constraints: Colocation Constraints: Cluster Properties: cluster-infrastructure: cman dc-version: 1.1.11-97629de last-lrm-refresh: 1412269491 no-quorum-policy: ignore stonith-enabled: false </CONFIGURATION> Has anyone have suggestion on how I can solve this issue? Why did the failover from node1 to node2 didn't work ? If more information is require let me know, any suggestion would be appreciated! Thanx! -- !!!!! ( o o ) --------------oOO----(_)----OOo-------------- Luc Paulin email: paulinster(at)gmail.com Skype: paulinster
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org