Dear all, I configured meta failure-timeout=60sec on all of my resources. For the sake of simplicity, assume I have a group of two resources FIRST and SECOND (where SECOND is started after FIRST, surprise!).
If now FIRST crashes, I see a failure, as expected. I also see that SECOND is stopped, as expected. Sadly, SECOND needs more than 60 seconds to stop. Thus, it can happen that the "failure-timeout" for FIRST is reached, and its failure is cleaned. This also is expected. The problem now is that after the 60sec timeout pacemaker assumes that FIRST is in the Started state. There is no indication about that in the log files, and the last monitor operation which ran just a few seconds before also indicated that FIRST is actually not running. As a consequence of the bug, pacemaker tries to re-start SECOND on the same system, which fails to start (as it depends on FIRST, which actually is not running). Only then the resources are started on the other system. So, my question is: Why does pacemaker assume that a previously failed resource is "Started" when the "meta failure-timeout" is triggered? Why is the monitor operation not invoked to determine the correct state? The corresponding lines of the log file, about a minute after FIRST crashed and the stop operation for SECOND was triggered: Oct 16 16:27:20 [2100] HOSTNAME [...] (monitor operation indicating that FIRST is not running) [...] Oct 16 16:27:23 [2104] HOSTNAME lrmd: info: log_finished: finished - rsc:SECOND action:stop call_id:123 pid:29314 exit-code:0 exec-time:62827ms queue-time:0ms Oct 16 16:27:23 [2107] HOSTNAME crmd: notice: process_lrm_event: LRM operation SECOND_stop_0 (call=123, rc=0, cib-update=225, confirmed=true) ok Oct 16 16:27:23 [2107] HOSTNAME crmd: info: match_graph_event: Action SECOND_stop_0 (74) confirmed on HOSTNAME (rc=0) Oct 16 16:27:23 [2107] HOSTNAME crmd: notice: run_graph: Transition 40 (Complete=5, Pending=0, Fired=0, Skipped=31, Incomplete=10, Source=/var/lib/pacemaker/pengine/pe-input-2937.bz2): Stopped Oct 16 16:27:23 [2107] HOSTNAME crmd: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ] Oct 16 16:27:23 [2100] HOSTNAME cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=local/crmd/225, version=0.1450.89) Oct 16 16:27:23 [2100] HOSTNAME cib: info: cib_process_request: Completed cib_query operation for section 'all': OK (rc=0, origin=local/crmd/226, version=0.1450.89) Oct 16 16:27:23 [2106] HOSTNAME pengine: notice: unpack_config: On loss of CCM Quorum: Ignore Oct 16 16:27:23 [2106] HOSTNAME pengine: info: determine_online_status_fencing: Node HOSTNAME is active Oct 16 16:27:23 [2106] HOSTNAME pengine: info: determine_online_status: Node HOSTNAME is online [...] Oct 16 16:27:23 [2106] HOSTNAME pengine: info: get_failcount_full: FIRST has failed 1 times on HOSTNAME Oct 16 16:27:23 [2106] HOSTNAME pengine: notice: unpack_rsc_op: Clearing expired failcount for FIRST on HOSTNAME Oct 16 16:27:23 [2106] HOSTNAME pengine: info: get_failcount_full: FIRST has failed 1 times on HOSTNAME Oct 16 16:27:23 [2106] HOSTNAME pengine: notice: unpack_rsc_op: Clearing expired failcount for FIRST on HOSTNAME Oct 16 16:27:23 [2106] HOSTNAME pengine: info: get_failcount_full: FIRST has failed 1 times on HOSTNAME Oct 16 16:27:23 [2106] HOSTNAME pengine: notice: unpack_rsc_op: Clearing expired failcount for FIRST on HOSTNAME Oct 16 16:27:23 [2106] HOSTNAME pengine: notice: unpack_rsc_op: Re-initiated expired calculated failure FIRST_last_failure_0 (rc=7, magic=0:7;68:31:0:28c68203-6990-48fd-96cc-09f86e2b21f9) on HOSTNAME [...] Oct 16 16:27:23 [2106] HOSTNAME pengine: info: group_print: Resource Group: GROUP Oct 16 16:27:23 [2106] HOSTNAME pengine: info: native_print: FIRST (ocf::heartbeat:xxx): Started HOSTNAME Oct 16 16:27:23 [2106] HOSTNAME pengine: info: native_print: SECOND (ocf::heartbeat:yyy): Stopped Thank you, Carsten -- andrena objects ag Büro Frankfurt Clemensstr. 8 60487 Frankfurt Tel: +49 (0) 69 977 860 38 Fax: +49 (0) 69 977 860 39 http://www.andrena.de Vorstand: Hagen Buchwald, Matthias Grund, Dr. Dieter Kuhn Aufsichtsratsvorsitzender: Rolf Hetzelberger Sitz der Gesellschaft: Karlsruhe Amtsgericht Mannheim, HRB 109694 USt-IdNr. DE174314824 Bitte beachten Sie auch unsere anstehenden Veranstaltungen: http://www.andrena.de/events
signature.asc
Description: Digital signature
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org