[Pacemaker] meta failure-timeout: crashed resource is assumed to be Started?

Carsten Otto Thu, 16 Oct 2014 08:12:31 -0700

Dear all,

I configured meta failure-timeout=60sec on all of my resources. For the
sake of simplicity, assume I have a group of two resources FIRST and
SECOND (where SECOND is started after FIRST, surprise!).


If now FIRST crashes, I see a failure, as expected. I also see that
SECOND is stopped, as expected.

Sadly, SECOND needs more than 60 seconds to stop. Thus, it can happen
that the "failure-timeout" for FIRST is reached, and its failure is
cleaned. This also is expected.

The problem now is that after the 60sec timeout pacemaker assumes that
FIRST is in the Started state. There is no indication about that in the
log files, and the last monitor operation which ran just a few seconds
before also indicated that FIRST is actually not running.

As a consequence of the bug, pacemaker tries to re-start SECOND on the
same system, which fails to start (as it depends on FIRST, which
actually is not running). Only then the resources are started on the
other system.

So, my question is:
Why does pacemaker assume that a previously failed resource is "Started"
when the "meta failure-timeout" is triggered? Why is the monitor
operation not invoked to determine the correct state?

The corresponding lines of the log file, about a minute after FIRST
crashed and the stop operation for SECOND was triggered:

Oct 16 16:27:20 [2100] HOSTNAME [...] (monitor operation indicating that FIRST 
is not running)
[...]
Oct 16 16:27:23 [2104] HOSTNAME       lrmd:     info: log_finished:         
finished - rsc:SECOND action:stop call_id:123 pid:29314 exit-code:0 
exec-time:62827ms queue-time:0ms
Oct 16 16:27:23 [2107] HOSTNAME       crmd:   notice: process_lrm_event:    LRM 
operation SECOND_stop_0 (call=123, rc=0, cib-update=225, confirmed=true) ok
Oct 16 16:27:23 [2107] HOSTNAME       crmd:     info: match_graph_event:    
Action SECOND_stop_0 (74) confirmed on HOSTNAME (rc=0)
Oct 16 16:27:23 [2107] HOSTNAME       crmd:   notice: run_graph:    Transition 
40 (Complete=5, Pending=0, Fired=0, Skipped=31, Incomplete=10, 
Source=/var/lib/pacemaker/pengine/pe-input-2937.bz2): Stopped
Oct 16 16:27:23 [2107] HOSTNAME       crmd:     info: do_state_transition:  
State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC 
cause=C_FSA_INTERNAL origin=notify_crmd ]
Oct 16 16:27:23 [2100] HOSTNAME        cib:     info: cib_process_request:  
Completed cib_modify operation for section status: OK (rc=0, 
origin=local/crmd/225, version=0.1450.89)
Oct 16 16:27:23 [2100] HOSTNAME        cib:     info: cib_process_request:  
Completed cib_query operation for section 'all': OK (rc=0, 
origin=local/crmd/226, version=0.1450.89)
Oct 16 16:27:23 [2106] HOSTNAME    pengine:   notice: unpack_config:        On 
loss of CCM Quorum: Ignore
Oct 16 16:27:23 [2106] HOSTNAME    pengine:     info: 
determine_online_status_fencing:      Node HOSTNAME is active
Oct 16 16:27:23 [2106] HOSTNAME    pengine:     info: determine_online_status:  
    Node HOSTNAME is online
[...]
Oct 16 16:27:23 [2106] HOSTNAME    pengine:     info: get_failcount_full:   
FIRST has failed 1 times on HOSTNAME
Oct 16 16:27:23 [2106] HOSTNAME    pengine:   notice: unpack_rsc_op:        
Clearing expired failcount for FIRST on HOSTNAME
Oct 16 16:27:23 [2106] HOSTNAME    pengine:     info: get_failcount_full:   
FIRST has failed 1 times on HOSTNAME
Oct 16 16:27:23 [2106] HOSTNAME    pengine:   notice: unpack_rsc_op:        
Clearing expired failcount for FIRST on HOSTNAME
Oct 16 16:27:23 [2106] HOSTNAME    pengine:     info: get_failcount_full:   
FIRST has failed 1 times on HOSTNAME
Oct 16 16:27:23 [2106] HOSTNAME    pengine:   notice: unpack_rsc_op:        
Clearing expired failcount for FIRST on HOSTNAME
Oct 16 16:27:23 [2106] HOSTNAME    pengine:   notice: unpack_rsc_op:        
Re-initiated expired calculated failure FIRST_last_failure_0 (rc=7, 
magic=0:7;68:31:0:28c68203-6990-48fd-96cc-09f86e2b21f9) on HOSTNAME
[...]
Oct 16 16:27:23 [2106] HOSTNAME    pengine:     info: group_print:   Resource 
Group: GROUP
Oct 16 16:27:23 [2106] HOSTNAME    pengine:     info: native_print:             
 FIRST   (ocf::heartbeat:xxx):      Started HOSTNAME 
Oct 16 16:27:23 [2106] HOSTNAME    pengine:     info: native_print:             
 SECOND     (ocf::heartbeat:yyy):        Stopped 

Thank you,
Carsten
-- 
andrena objects ag
Büro Frankfurt
Clemensstr. 8
60487 Frankfurt

Tel: +49 (0) 69 977 860 38
Fax: +49 (0) 69 977 860 39
http://www.andrena.de

Vorstand: Hagen Buchwald, Matthias Grund, Dr. Dieter Kuhn
Aufsichtsratsvorsitzender: Rolf Hetzelberger

Sitz der Gesellschaft: Karlsruhe
Amtsgericht Mannheim, HRB 109694
USt-IdNr. DE174314824

Bitte beachten Sie auch unsere anstehenden Veranstaltungen:
http://www.andrena.de/events

signature.asc
Description: Digital signature

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[Pacemaker] meta failure-timeout: crashed resource is assumed to be Started?

Reply via email to