Re: [Pacemaker] On recovery of failed node, pengine fails to correctly monitor 'dirty' resources

Jarred Griggles Tue, 12 Aug 2014 09:30:07 -0700

This seems to happen in every resource agent we've been using -- including 
ClusterMon, anything, etc. These resource agents also aren't being called when 
these errors appear -- though the logs in /var/log/messages would indicate as 
if they were called. The calls may differ that are being displayed -- always a 
monitor, but sometimes a stop and start as well. In either case, the resource 
agent never receives those calls.

Below is a log from this happening. I'll annotate it to try to provide some 
context.

ug 12 11:22:10 ti14-demo1 kernel: DLM (built Dec  9 2013 18:27:30) installed
Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [MAIN  ] Corosync Cluster Engine 
('1.4.1'): started and ready to provide service.
Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [MAIN  ] Corosync built-in 
features: nss dbus rdma snmp
Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [MAIN  ] Successfully read config 
from /etc/cluster/cluster.conf
Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [MAIN  ] Successfully parsed cman 
config
Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [TOTEM ] Initializing transport 
(UDP/IP Unicast).
Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [TOTEM ] Initializing 
transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [TOTEM ] The network interface 
[10.28.230.201] is now up.
Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [QUORUM] Using quorum provider 
quorum_cman
Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [SERV  ] Service engine loaded: 
corosync cluster quorum service v0.1
Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [CMAN  ] CMAN 3.0.12.1 (built Apr 
14 2014 09:36:10) started
Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [SERV  ] Service engine loaded: 
corosync CMAN membership service 2.90
Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [SERV  ] Service engine loaded: 
openais checkpoint service B.01.01
Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [SERV  ] Service engine loaded: 
corosync extended virtual synchrony service
Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [SERV  ] Service engine loaded: 
corosync configuration service
Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [SERV  ] Service engine loaded: 
corosync cluster closed process group service v1.01
Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [SERV  ] Service engine loaded: 
corosync cluster config database access v1.01
Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [SERV  ] Service engine loaded: 
corosync profile loading service
Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [QUORUM] Using quorum provider 
quorum_cman
Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [SERV  ] Service engine loaded: 
corosync cluster quorum service v0.1
Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [MAIN  ] Compatibility mode set to 
whitetank.  Using V1 and V2 of the synchronization engine.
Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [TOTEM ] adding new UDPU member 
{10.28.230.201}
Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [TOTEM ] adding new UDPU member 
{10.28.230.203}
Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [TOTEM ] A processor joined or 
left the membership and a new membership was formed.
Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [QUORUM] Members[1]: 1
Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [QUORUM] Members[1]: 1
Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [CPG   ] chosen downlist: sender 
r(0) ip(10.28.230.201) ; members(old:0 left:0)
Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [MAIN  ] Completed service 
synchronization, ready to provide service.
Aug 12 11:22:18 ti14-demo1 fenced[2978]: fenced 3.0.12.1 started
Aug 12 11:22:18 ti14-demo1 dlm_controld[2997]: dlm_controld 3.0.12.1 started
Aug 12 11:22:19 ti14-demo1 corosync[2925]:   [TOTEM ] A processor joined or 
left the membership and a new membership was formed.
Aug 12 11:22:19 ti14-demo1 corosync[2925]:   [CMAN  ] quorum regained, resuming 
activity
Aug 12 11:22:19 ti14-demo1 corosync[2925]:   [QUORUM] This node is within the 
primary component and will provide service.
Aug 12 11:22:19 ti14-demo1 corosync[2925]:   [QUORUM] Members[2]: 1 2
Aug 12 11:22:19 ti14-demo1 corosync[2925]:   [QUORUM] Members[2]: 1 2
Aug 12 11:22:19 ti14-demo1 corosync[2925]:   [CPG   ] chosen downlist: sender 
r(0) ip(10.28.230.201) ; members(old:1 left:0)
Aug 12 11:22:19 ti14-demo1 corosync[2925]:   [MAIN  ] Completed service 
synchronization, ready to provide service.
Aug 12 11:22:19 ti14-demo1 gfs_controld[3053]: gfs_controld 3.0.12.1 started
Aug 12 11:22:20 ti14-demo1 pacemaker: Starting Pacemaker Cluster Manager
Aug 12 11:22:20 ti14-demo1 pacemakerd[3136]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Aug 12 11:22:20 ti14-demo1 pacemakerd[3136]:   notice: main: Starting Pacemaker 
1.1.10-14.el6_5.3 (Build: 368c726):  generated-manpages agent-manpages 
ascii-docs publican-docs ncurses libqb-logging libqb-ipc nagios  
corosync-plugin cman
Aug 12 11:22:21 ti14-demo1 stonith-ng[3143]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Aug 12 11:22:21 ti14-demo1 lrmd[3144]:   notice: crm_add_logfile: Additional 
logging available in /var/log/cluster/corosync.log
Aug 12 11:22:21 ti14-demo1 stonith-ng[3143]:   notice: crm_cluster_connect: 
Connecting to cluster infrastructure: cman
Aug 12 11:22:21 ti14-demo1 attrd[3145]:   notice: crm_add_logfile: Additional 
logging available in /var/log/cluster/corosync.log
Aug 12 11:22:21 ti14-demo1 attrd[3145]:   notice: crm_cluster_connect: 
Connecting to cluster infrastructure: cman
Aug 12 11:22:21 ti14-demo1 cib[3142]:   notice: crm_add_logfile: Additional 
logging available in /var/log/cluster/corosync.log
Aug 12 11:22:21 ti14-demo1 pengine[3146]:   notice: crm_add_logfile: Additional 
logging available in /var/log/cluster/corosync.log
Aug 12 11:22:21 ti14-demo1 crmd[3147]:   notice: crm_add_logfile: Additional 
logging available in /var/log/cluster/corosync.log
Aug 12 11:22:21 ti14-demo1 crmd[3147]:   notice: main: CRM Git Version: 368c726
Aug 12 11:22:21 ti14-demo1 attrd[3145]:   notice: main: Starting mainloop...
Aug 12 11:22:21 ti14-demo1 cib[3142]:   notice: crm_cluster_connect: Connecting 
to cluster infrastructure: cman
Aug 12 11:22:22 ti14-demo1 crmd[3147]:   notice: crm_cluster_connect: 
Connecting to cluster infrastructure: cman
Aug 12 11:22:22 ti14-demo1 stonith-ng[3143]:   notice: setup_cib: Watching for 
stonith topology changes
Aug 12 11:22:22 ti14-demo1 stonith-ng[3143]:   notice: unpack_config: On loss 
of CCM Quorum: Ignore
Aug 12 11:22:22 ti14-demo1 crmd[3147]:   notice: cman_event_callback: 
Membership 28: quorum acquired
Aug 12 11:22:22 ti14-demo1 crmd[3147]:   notice: crm_update_peer_state: 
cman_event_callback: Node ti14-demo1[1] - state is now member (was (null))
Aug 12 11:22:22 ti14-demo1 crmd[3147]:   notice: crm_update_peer_state: 
cman_event_callback: Node ti14-demo2[2] - state is now member (was (null))
Aug 12 11:22:22 ti14-demo1 crmd[3147]:   notice: do_started: The local CRM is 
operational
Aug 12 11:22:22 ti14-demo1 crmd[3147]:   notice: do_state_transition: State 
transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL 
origin=do_started ]
Aug 12 11:22:43 ti14-demo1 crmd[3147]:  warning: do_log: FSA: Input 
I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
Aug 12 11:22:43 ti14-demo1 crmd[3147]:   notice: do_state_transition: State 
transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC 
cause=C_FSA_INTERNAL origin=do_election_check ]
Aug 12 11:22:43 ti14-demo1 attrd[3145]:   notice: attrd_local_callback: Sending 
full refresh (origin=crmd)
Aug 12 11:22:43 ti14-demo1 pengine[3146]:   notice: unpack_config: On loss of 
CCM Quorum: Ignore
Aug 12 11:22:43 ti14-demo1 pengine[3146]:   notice: LogActions: Start   
cmon#011(ti14-demo1)
Aug 12 11:22:43 ti14-demo1 pengine[3146]:   notice: process_pe_message: 
Calculated Transition 0: /var/lib/pacemaker/pengine/pe-input-10.bz2
Aug 12 11:22:43 ti14-demo1 crmd[3147]:   notice: te_rsc_command: Initiating 
action 6: monitor cmon_monitor_0 on ti14-demo2
Aug 12 11:22:43 ti14-demo1 crmd[3147]:   notice: te_rsc_command: Initiating 
action 4: monitor cmon_monitor_0 on ti14-demo1 (local)
Aug 12 11:22:43 ti14-demo1 crmd[3147]:   notice: process_lrm_event: LRM 
operation cmon_monitor_0 (call=5, rc=7, cib-update=26, confirmed=true) not 
running
Aug 12 11:22:43 ti14-demo1 crmd[3147]:   notice: te_rsc_command: Initiating 
action 3: probe_complete probe_complete on ti14-demo1 (local) - no waiting
Aug 12 11:22:43 ti14-demo1 attrd[3145]:   notice: attrd_trigger_update: Sending 
flush op to all hosts for: probe_complete (true)
Aug 12 11:22:43 ti14-demo1 attrd[3145]:   notice: attrd_perform_update: Sent 
update 4: probe_complete=true
Aug 12 11:22:43 ti14-demo1 crmd[3147]:   notice: te_rsc_command: Initiating 
action 5: probe_complete probe_complete on ti14-demo2 - no waiting
Aug 12 11:22:43 ti14-demo1 crmd[3147]:   notice: te_rsc_command: Initiating 
action 7: start cmon_start_0 on ti14-demo1 (local)
Aug 12 11:22:43 ti14-demo1 crmd[3147]:   notice: process_lrm_event: LRM 
operation cmon_start_0 (call=8, rc=0, cib-update=27, confirmed=true) ok
Aug 12 11:22:43 ti14-demo1 crmd[3147]:   notice: te_rsc_command: Initiating 
action 8: monitor cmon_monitor_60000 on ti14-demo1 (local)
Aug 12 11:22:43 ti14-demo1 crmd[3147]:   notice: process_lrm_event: LRM 
operation cmon_monitor_60000 (call=11, rc=0, cib-update=28, confirmed=false) ok
Aug 12 11:22:43 ti14-demo1 crmd[3147]:   notice: run_graph: Transition 0 
(Complete=7, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-10.bz2): Complete
Aug 12 11:22:43 ti14-demo1 crmd[3147]:   notice: do_state_transition: State 
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd ]

killing a cluster-mon on ti14-demo1, resulting in a monitor fail (as expected)

Aug 12 11:24:44 ti14-demo1 lrmd[3144]:   notice: operation_finished: 
cmon_monitor_60000:3206:stderr [ /usr/lib/ocf/resource.d/heartbeat/ClusterMon: 
line 155: kill: (3175) - No such process ]
Aug 12 11:24:44 ti14-demo1 crmd[3147]:   notice: process_lrm_event: LRM 
operation cmon_monitor_60000 (call=11, rc=7, cib-update=29, confirmed=false) 
not running
Aug 12 11:24:44 ti14-demo1 crmd[3147]:  warning: update_failcount: Updating 
failcount for cmon on ti14-demo1 after failed monitor: rc=7 (update=value++, 
time=1407857084)
Aug 12 11:24:44 ti14-demo1 crmd[3147]:   notice: do_state_transition: State 
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
origin=abort_transition_graph ]
Aug 12 11:24:44 ti14-demo1 attrd[3145]:   notice: attrd_trigger_update: Sending 
flush op to all hosts for: fail-count-cmon (1)
Aug 12 11:24:44 ti14-demo1 attrd[3145]:   notice: attrd_perform_update: Sent 
update 9: fail-count-cmon=1
Aug 12 11:24:44 ti14-demo1 attrd[3145]:   notice: attrd_trigger_update: Sending 
flush op to all hosts for: last-failure-cmon (1407857084)
Aug 12 11:24:44 ti14-demo1 pengine[3146]:   notice: unpack_config: On loss of 
CCM Quorum: Ignore
Aug 12 11:24:44 ti14-demo1 attrd[3145]:   notice: attrd_perform_update: Sent 
update 12: last-failure-cmon=1407857084
Aug 12 11:24:44 ti14-demo1 pengine[3146]:  warning: unpack_rsc_op: Processing 
failed op monitor for cmon on ti14-demo1: not running (7)
Aug 12 11:24:44 ti14-demo1 pengine[3146]:   notice: LogActions: Recover 
cmon#011(Started ti14-demo1)
Aug 12 11:24:44 ti14-demo1 pengine[3146]:   notice: process_pe_message: 
Calculated Transition 1: /var/lib/pacemaker/pengine/pe-input-11.bz2
Aug 12 11:24:44 ti14-demo1 pengine[3146]:   notice: unpack_config: On loss of 
CCM Quorum: Ignore
Aug 12 11:24:44 ti14-demo1 pengine[3146]:  warning: unpack_rsc_op: Processing 
failed op monitor for cmon on ti14-demo1: not running (7)
Aug 12 11:24:44 ti14-demo1 pengine[3146]:   notice: LogActions: Recover 
cmon#011(Started ti14-demo1)
Aug 12 11:24:44 ti14-demo1 pengine[3146]:   notice: process_pe_message: 
Calculated Transition 2: /var/lib/pacemaker/pengine/pe-input-12.bz2
Aug 12 11:24:44 ti14-demo1 pengine[3146]:   notice: unpack_config: On loss of 
CCM Quorum: Ignore
Aug 12 11:24:44 ti14-demo1 pengine[3146]:  warning: unpack_rsc_op: Processing 
failed op monitor for cmon on ti14-demo1: not running (7)
Aug 12 11:24:44 ti14-demo1 pengine[3146]:   notice: LogActions: Recover 
cmon#011(Started ti14-demo1)
Aug 12 11:24:44 ti14-demo1 crmd[3147]:   notice: te_rsc_command: Initiating 
action 2: stop cmon_stop_0 on ti14-demo1 (local)
Aug 12 11:24:44 ti14-demo1 pengine[3146]:   notice: process_pe_message: 
Calculated Transition 3: /var/lib/pacemaker/pengine/pe-input-13.bz2
Aug 12 11:24:44 ti14-demo1 lrmd[3144]:   notice: operation_finished: 
cmon_stop_0:3211:stderr [ /usr/lib/ocf/resource.d/heartbeat/ClusterMon: line 
145: kill: (3175) - No such process ]
Aug 12 11:24:44 ti14-demo1 crmd[3147]:   notice: process_lrm_event: LRM 
operation cmon_stop_0 (call=16, rc=0, cib-update=33, confirmed=true) ok
Aug 12 11:24:44 ti14-demo1 crmd[3147]:   notice: te_rsc_command: Initiating 
action 7: start cmon_start_0 on ti14-demo1 (local)
Aug 12 11:24:44 ti14-demo1 crmd[3147]:   notice: process_lrm_event: LRM 
operation cmon_start_0 (call=20, rc=0, cib-update=34, confirmed=true) ok
Aug 12 11:24:44 ti14-demo1 crmd[3147]:   notice: te_rsc_command: Initiating 
action 1: monitor cmon_monitor_60000 on ti14-demo1 (local)
Aug 12 11:24:44 ti14-demo1 crmd[3147]:   notice: process_lrm_event: LRM 
operation cmon_monitor_60000 (call=23, rc=0, cib-update=35, confirmed=false) ok
Aug 12 11:24:44 ti14-demo1 crmd[3147]:   notice: run_graph: Transition 3 
(Complete=4, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-13.bz2): Complete
Aug 12 11:24:44 ti14-demo1 crmd[3147]:   notice: do_state_transition: State 
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd ]

Rebooting ti14-demo2

Aug 12 11:25:25 ti14-demo1 corosync[2925]:   [TOTEM ] A processor failed, 
forming new configuration.
Aug 12 11:25:27 ti14-demo1 corosync[2925]:   [CMAN  ] quorum lost, blocking 
activity
Aug 12 11:25:27 ti14-demo1 corosync[2925]:   [QUORUM] This node is within the 
non-primary component and will NOT provide any services.
Aug 12 11:25:27 ti14-demo1 corosync[2925]:   [QUORUM] Members[1]: 1
Aug 12 11:25:27 ti14-demo1 corosync[2925]:   [TOTEM ] A processor joined or 
left the membership and a new membership was formed.
Aug 12 11:25:27 ti14-demo1 crmd[3147]:   notice: cman_event_callback: 
Membership 32: quorum lost
Aug 12 11:25:27 ti14-demo1 kernel: dlm: closing connection to node 2
Aug 12 11:25:27 ti14-demo1 corosync[2925]:   [CPG   ] chosen downlist: sender 
r(0) ip(10.28.230.201) ; members(old:2 left:1)
Aug 12 11:25:27 ti14-demo1 corosync[2925]:   [MAIN  ] Completed service 
synchronization, ready to provide service.
Aug 12 11:25:27 ti14-demo1 crmd[3147]:   notice: crm_update_peer_state: 
cman_event_callback: Node ti14-demo2[2] - state is now lost (was member)
Aug 12 11:25:27 ti14-demo1 crmd[3147]:  warning: match_down_event: No match for 
shutdown action on ti14-demo2
Aug 12 11:25:27 ti14-demo1 crmd[3147]:   notice: peer_update_callback: 
Stonith/shutdown of ti14-demo2 not matched
Aug 12 11:25:27 ti14-demo1 crmd[3147]:   notice: do_state_transition: State 
transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL 
origin=check_join_state ]
Aug 12 11:25:27 ti14-demo1 attrd[3145]:   notice: attrd_local_callback: Sending 
full refresh (origin=crmd)
Aug 12 11:25:27 ti14-demo1 attrd[3145]:   notice: attrd_trigger_update: Sending 
flush op to all hosts for: last-failure-cmon (1407857084)
Aug 12 11:25:27 ti14-demo1 crmd[3147]:  warning: match_down_event: No match for 
shutdown action on ti14-demo2
Aug 12 11:25:27 ti14-demo1 crmd[3147]:   notice: peer_update_callback: 
Stonith/shutdown of ti14-demo2 not matched
Aug 12 11:25:27 ti14-demo1 attrd[3145]:   notice: attrd_trigger_update: Sending 
flush op to all hosts for: fail-count-cmon (1)
Aug 12 11:25:27 ti14-demo1 pengine[3146]:   notice: unpack_config: On loss of 
CCM Quorum: Ignore
Aug 12 11:25:27 ti14-demo1 pengine[3146]:  warning: unpack_rsc_op: Processing 
failed op monitor for cmon on ti14-demo1: not running (7)
Aug 12 11:25:27 ti14-demo1 attrd[3145]:   notice: attrd_trigger_update: Sending 
flush op to all hosts for: probe_complete (true)
Aug 12 11:25:27 ti14-demo1 pengine[3146]:   notice: process_pe_message: 
Calculated Transition 4: /var/lib/pacemaker/pengine/pe-input-14.bz2
Aug 12 11:25:27 ti14-demo1 crmd[3147]:   notice: run_graph: Transition 4 
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-14.bz2): Complete
Aug 12 11:25:27 ti14-demo1 crmd[3147]:   notice: do_state_transition: State 
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd ]
Aug 12 11:28:03 ti14-demo1 corosync[2925]:   [TOTEM ] A processor joined or 
left the membership and a new membership was formed.
Aug 12 11:28:03 ti14-demo1 corosync[2925]:   [CMAN  ] quorum regained, resuming 
activity
Aug 12 11:28:03 ti14-demo1 corosync[2925]:   [QUORUM] This node is within the 
primary component and will provide service.
Aug 12 11:28:03 ti14-demo1 corosync[2925]:   [QUORUM] Members[2]: 1 2
Aug 12 11:28:03 ti14-demo1 corosync[2925]:   [QUORUM] Members[2]: 1 2
Aug 12 11:28:03 ti14-demo1 crmd[3147]:   notice: cman_event_callback: 
Membership 36: quorum acquired
Aug 12 11:28:03 ti14-demo1 crmd[3147]:   notice: crm_update_peer_state: 
cman_event_callback: Node ti14-demo2[2] - state is now member (was lost)
Aug 12 11:28:03 ti14-demo1 corosync[2925]:   [CPG   ] chosen downlist: sender 
r(0) ip(10.28.230.201) ; members(old:1 left:0)
Aug 12 11:28:03 ti14-demo1 corosync[2925]:   [MAIN  ] Completed service 
synchronization, ready to provide service.
Aug 12 11:28:11 ti14-demo1 crmd[3147]:   notice: do_state_transition: State 
transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL 
origin=peer_update_callback ]
Aug 12 11:28:13 ti14-demo1 attrd[3145]:   notice: attrd_local_callback: Sending 
full refresh (origin=crmd)
Aug 12 11:28:13 ti14-demo1 attrd[3145]:   notice: attrd_trigger_update: Sending 
flush op to all hosts for: last-failure-cmon (1407857084)
Aug 12 11:28:13 ti14-demo1 attrd[3145]:   notice: attrd_trigger_update: Sending 
flush op to all hosts for: fail-count-cmon (1)
Aug 12 11:28:13 ti14-demo1 attrd[3145]:   notice: attrd_trigger_update: Sending 
flush op to all hosts for: probe_complete (true)
Aug 12 11:28:14 ti14-demo1 pengine[3146]:   notice: unpack_config: On loss of 
CCM Quorum: Ignore
Aug 12 11:28:14 ti14-demo1 pengine[3146]:  warning: unpack_rsc_op: Processing 
failed op monitor for cmon on ti14-demo1: not running (7)

monitor fails for cluster_mon on ti14-demo1 -- but it didn't actually ever get 
called, or fail in any way. cluster_mon has been running fine since being 
restarted at the very beginning.

Aug 12 11:28:14 ti14-demo1 pengine[3146]:   notice: process_pe_message: 
Calculated Transition 5: /var/lib/pacemaker/pengine/pe-input-15.bz2
Aug 12 11:28:14 ti14-demo1 crmd[3147]:   notice: te_rsc_command: Initiating 
action 6: monitor cmon_monitor_0 on ti14-demo2
Aug 12 11:28:14 ti14-demo1 crmd[3147]:   notice: te_rsc_command: Initiating 
action 5: probe_complete probe_complete on ti14-demo2 - no waiting
Aug 12 11:28:14 ti14-demo1 crmd[3147]:   notice: run_graph: Transition 5 
(Complete=3, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-15.bz2): Complete
Aug 12 11:28:14 ti14-demo1 crmd[3147]:   notice: do_state_transition: State 
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd ]

On Monday, August 11, 2014 4:06 PM, David Vossel <dvos...@redhat.com> wrote:

----- Original Message -----
> Greetings,
> 
> We are using pacemaker and cman in a two-node cluster with no-quorum-policy:
> ignore and stonith-enabled: false on a Centos 6 system (pacemaker related
> RPM versions are listed below). We are seeing some bizarre (to us) behavior
> when a node is fully lost (e.g. reboot -nf ). Here's the scenario we have:
> 
> 1) Fail a resource named "some-resource" started with the
> ocf:heartbeat:anything script (or others) on node01 (in our case, it's a
> master/slave resource we're pulling observations from, but it can happen on
> normal ones).
> 2) Wait for Resource to recover.
> 3) Fail node02 (reboot -nf, or power loss)
> 4) When node02 recovers, we see in /var/log/messages:
> - Quorum is recovered
> - Sending flush op to all hosts for master-some-resource,
> last-failure-some-resource, probe_complete(true),
> fail-count-some-resource(1)
> - pengine Processing failed op monitor for some-resource on node01: unknown
> error (1)
> * After adding a simple "`date` called with $@ >> /tmp/log.rsc", we do not
> see the resource agent being called at this time, on either node.
> * Sometimes, we see other operations happen that are also not being sent to
> the RA, including stop/start
> * The resource is actually happilly running on node01 throughtout this whole
> process, so there's no reason we should be seeing this failure here.
> * This issue is only seen on resources that had not yet been cleaned up.
> Resources that were 'clean' when both nodes were last online do not have
> this issue.
> 
> We noticed this originally because we are using the ClusterMon RA to report
> on different types of errors, and this is giving us false positives. Any
> thoughts on configuration issues we could be having, or if this sounds like
> a bug in pacemaker somewhere?

This is likely a bug in whatever resource-agent you are using.  There's no way
for us to know for sure without logs.

-- Vossel

> 
> Thanks!
> 
> ----
> Versions:
> ccs-0.16.2-69.el6_5.1.x86_64
> clusterlib-3.0.12.1-59.el6_5.2.x86_64
> cman-3.0.12.1-59.el6_5.2.x86_64
> corosync-1.4.1-17.el6_5.1.x86_64
> corosynclib-1.4.1-17.el6_5.1.x86_64
> fence-virt-0.2.3-15.el6.x86_64
> libqb-0.16.0-2.el6.x86_64
> modcluster-0.16.2-28.el6.x86_64
> openais-1.1.1-7.el6.x86_64
> openaislib-1.1.1-7.el6.x86_64
> pacemaker-1.1.10-14.el6_5.3.x86_64
> pacemaker-cli-1.1.10-14.el6_5.3.x86_64
> pacemaker-cluster-libs-1.1.10-14.el6_5.3.x86_64
> pacemaker-libs-1.1.10-14.el6_5.3.x86_64
> pcs-0.9.90-2.el6.centos.3.noarch
> resource-agents-3.9.2-40.el6_5.7.x86_64
> ricci-0.16.2-69.el6_5.1.x86_64
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

> 

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] On recovery of failed node, pengine fails to correctly monitor 'dirty' resources

Reply via email to