Re: [Pacemaker] S_POLICY_ENGINE state continues being maintained

Kazunori INOUE Fri, 24 May 2013 01:08:42 -0700

(13.05.24 13:38), Andrew Beekhof wrote:


On 24/05/2013, at 2:19 PM, Andrew Beekhof <and...@beekhof.net> wrote:


On 23/05/2013, at 4:44 PM, Kazunori INOUE <inouek...@intellilink.co.jp> wrote:

Hi,

I'm using pacemaker-1.1 (c3486a4a8d. the latest devel).
After fencing caused by split-brain failed 11 times, S_POLICY_ENGINE state is 
kept even if I recover split-brain.


Odd, I get:

May 24 00:17:08 corosync-host-1 crmd[3056]:   notice: tengine_stonith_callback: 
Stonith operation 12/69:23:0:9b069b96-3565-4219-85a5-8782bdb5d9d3: No route to 
host (-113)
May 24 00:17:08 corosync-host-1 crmd[3056]:   notice: tengine_stonith_callback: 
Stonith operation 12 for corosync-host-6 failed (No route to host): aborting 
transition.
May 24 00:17:08 corosync-host-1 crmd[3056]:   notice: run_graph: Transition 23 
(Complete=1, Pending=0, Fired=0, Skipped=2, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-warn-110.bz2): Stopped
May 24 00:17:08 corosync-host-1 crmd[3056]:   notice: too_many_st_failures: Too 
many failures to fence corosync-host-6 (11), giving up
May 24 00:17:08 corosync-host-1 crmd[3056]:   notice: do_state_transition: State 
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd ]
May 24 00:17:08 corosync-host-1 crmd[3056]:   notice: tengine_stonith_notify: 
Peer corosync-host-6 was not terminated (reboot) by corosync-host-1 for 
corosync-host-1: No route to host (ref=9dd3711e-c87d-4b2e-acd1-854391a6fa9d) by 
client crmd.3056


Same for you:

May 23 13:17:28 [24868] dev1       crmd:   notice: too_many_st_failures:        
Too many failures to fence dev2 (11), giving up
May 23 13:17:28 [24868] dev1       crmd:    debug: notify_crmd:         
Transition 10 status: restart - Stonith failed
May 23 13:17:28 [24868] dev1       crmd:    debug: s_crmd_fsa:  Processing 
I_TE_SUCCESS: [ state=S_TRANSITION_ENGINE cause=C_FSA_INTERNAL 
origin=notify_crmd ]
May 23 13:17:28 [24868] dev1       crmd:     info: do_log:      FSA: Input 
I_TE_SUCCESS from notify_crmd() received in state S_TRANSITION_ENGINE

and

May 23 13:17:28 [7107] dev2       crmd:   notice: too_many_st_failures:         
Too many failures to fence dev1 (11), giving up
May 23 13:17:28 [7107] dev2       crmd:    debug: notify_crmd:  Transition 13 
status: restart - Stonith failed
May 23 13:17:28 [7107] dev2       crmd:    debug: s_crmd_fsa:   Processing 
I_TE_SUCCESS: [ state=S_TRANSITION_ENGINE cause=C_FSA_INTERNAL 
origin=notify_crmd ]
May 23 13:17:28 [7107] dev2       crmd:     info: do_log:       FSA: Input 
I_TE_SUCCESS from notify_crmd() received in state S_TRANSITION_ENGINE
May 23 13:17:28 [7107] dev2       crmd:   notice: do_state_transition:  State 
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd ]

oh, but not here:

May 23 13:24:23 [7107] dev2       crmd:    debug: do_te_invoke:         
Cancelling the transition: inactive
May 23 13:24:23 [7107] dev2       crmd:     info: abort_transition_graph:       
do_te_invoke:155 - Triggered transition abort (complete=1) : Peer Cancelled
May 23 13:24:23 [7107] dev2       crmd:   notice: too_many_st_failures:         
Too many failures to fence dev1 (11), giving up
May 23 13:24:23 [7107] dev2       crmd:    debug: s_crmd_fsa:   Processing 
I_TE_SUCCESS: [ state=S_POLICY_ENGINE cause=C_FSA_INTERNAL 
origin=abort_transition_graph ]
May 23 13:24:23 [7107] dev2       crmd:  warning: do_log:       FSA: Input 
I_TE_SUCCESS from abort_transition_graph() received in state S_POLICY_ENGINE
May 23 13:24:23 [7107] dev2       crmd:    debug: te_update_diff:       Processing 
diff (cib_modify): 0.5.24 -> 0.5.25 (S_POLICY_ENGINE)
May 23 13:24:23 [7107] dev2       crmd:    debug: te_update_diff:       Processing 
diff (cib_modify): 0.5.25 -> 0.5.26 (S_POLICY_ENGINE)
May 23 13:24:23 [7107] dev2       crmd:    debug: 
join_update_complete_callback:        Join update 95 complete
May 23 13:24:23 [7107] dev2       crmd:    debug: check_join_state:     Invoked 
by join_update_complete_callback in state: S_POLICY_ENGINE
May 23 13:47:54 [7107] dev2       crmd:   notice: handle_request:       Current 
ping state: S_POLICY_ENGINE

Can you try the following patch?

diff --git a/crmd/te_utils.c b/crmd/te_utils.c
index ae4c5de..f3e0d9f 100644
--- a/crmd/te_utils.c
+++ b/crmd/te_utils.c
@@ -408,15 +408,11 @@ abort_transition_graph(int abort_priority, enum 
transition_action abort_action,
      fsa_pe_ref = NULL;

      if (transition_graph->complete) {
-        if (too_many_st_failures() == FALSE) {
-            if (transition_timer->period_ms > 0) {
-                crm_timer_stop(transition_timer);
-                crm_timer_start(transition_timer);
-            } else {
-                register_fsa_input(C_FSA_INTERNAL, I_PE_CALC, NULL);
-            }
+        if (transition_timer->period_ms > 0) {
+            crm_timer_stop(transition_timer);
+            crm_timer_start(transition_timer);
          } else {
-            register_fsa_input(C_FSA_INTERNAL, I_TE_SUCCESS, NULL);
+            register_fsa_input(C_FSA_INTERNAL, I_PE_CALC, NULL);
          }
          return;
      }


Hi Andrew,

I confirmed that this problem was fixed.


The expected behavior is that after too_many_st_failures() returns true, we 
will retry once per re-check interval until either the node is confirmed down 
with stonith_admin -C or fencing succeeds.
If the node comes back and fencing is no longer needed, but has still not been 
confirmed to work, then the count in too_many_st_failures() is not cleared.

Make sense?


It makes sense.
Thanks!


1. disconnect network connection
[dev1 ~]$ crm_mon
Last updated: Thu May 23 13:16:41 2013
Last change: Thu May 23 13:15:30 2013 via cibadmin on dev1
Stack: corosync
Current DC: dev1 (3232261525) - partition WITHOUT quorum
Version: 1.1.10-0.122.c3486a4.git.el6-c3486a4
2 Nodes configured, unknown expected votes
2 Resources configured.


Node dev2 (3232261523): UNCLEAN (offline)
Online: [ dev1 ]

f1      (stonith:external/libvirt.NG):  Started dev2
f2      (stonith:external/libvirt.NG):  Started dev1

[dev2 ~]$ crm_mon
Last updated: Thu May 23 13:16:41 2013
Last change: Thu May 23 13:15:30 2013 via cibadmin on dev1
Stack: corosync
Current DC: dev2 (3232261523) - partition WITHOUT quorum
Version: 1.1.10-0.122.c3486a4.git.el6-c3486a4
2 Nodes configured, unknown expected votes
2 Resources configured.


Node dev1 (3232261525): UNCLEAN (offline)
Online: [ dev2 ]

f1      (stonith:external/libvirt.NG):  Started dev2
f2      (stonith:external/libvirt.NG):  Started dev1


2. wait until fencing failed 11 times
[dev1 ~]$ egrep "CRIT|too_many_st_failures" /var/log/ha-log
May 23 13:16:46 dev1 stonith: [24981]: CRIT: external_reset_req: 'libvirt.NG 
reset' for host dev2 failed with rc 1
(snip)
May 23 13:17:24 dev1 stonith: [25105]: CRIT: external_reset_req: 'libvirt.NG 
reset' for host dev2 failed with rc 1
May 23 13:17:28 dev1 stonith: [25118]: CRIT: external_reset_req: 'libvirt.NG 
reset' for host dev2 failed with rc 1
May 23 13:17:28 dev1 crmd[24868]:   notice: too_many_st_failures: Too many 
failures to fence dev2 (11), giving up

[dev2 ~]$ egrep "CRIT|too_many_st_failures" /var/log/ha-log
May 23 13:16:46 dev2 stonith: [7177]: CRIT: external_reset_req: 'libvirt.NG 
reset' for host dev1 failed with rc 1
(snip)
May 23 13:17:23 dev2 stonith: [7295]: CRIT: external_reset_req: 'libvirt.NG 
reset' for host dev1 failed with rc 1
May 23 13:17:28 dev2 stonith: [7309]: CRIT: external_reset_req: 'libvirt.NG 
reset' for host dev1 failed with rc 1
May 23 13:17:28 dev2 crmd[7107]:   notice: too_many_st_failures: Too many 
failures to fence dev1 (11), giving up


3. recover network disconnection
[dev1 ~]$ crm_mon
Last updated: Thu May 23 13:24:23 2013
Last change: Thu May 23 13:15:30 2013 via cibadmin on dev1
Stack: corosync
Current DC: dev2 (3232261523) - partition with quorum
Version: 1.1.10-0.122.c3486a4.git.el6-c3486a4
2 Nodes configured, unknown expected votes
2 Resources configured.


Online: [ dev1 dev2 ]

f1      (stonith:external/libvirt.NG):  Started dev2
f2      (stonith:external/libvirt.NG):  Started dev1


S_POLICY_ENGINE state continues being maintained although a member's join seems 
to have succeeded.

[13:47:54 root@dev1 ~]$ crmadmin -S dev2
Status of crmd@dev2: S_POLICY_ENGINE (ok)


Best Regards,
Kazunori INOUE
<keeping-S_POLICY_ENGINE.tar.bz2>_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] S_POLICY_ENGINE state continues being maintained

Reply via email to