Re: [Pacemaker] Question about the behavior when a pacemaker's process crashed

Kazunori INOUE Thu, 25 Jul 2013 02:10:53 -0700

(13.07.25 11:00), Andrew Beekhof wrote:


On 24/07/2013, at 7:40 PM, Kazunori INOUE <inouek...@intellilink.co.jp> wrote:

(13.07.18 19:23), Andrew Beekhof wrote:


On 17/07/2013, at 6:53 PM, Kazunori INOUE <inouek...@intellilink.co.jp> wrote:

(13.07.16 21:18), Andrew Beekhof wrote:


On 16/07/2013, at 7:04 PM, Kazunori INOUE <inouek...@intellilink.co.jp> wrote:

(13.07.15 11:00), Andrew Beekhof wrote:


On 12/07/2013, at 6:28 PM, Kazunori INOUE <inouek...@intellilink.co.jp> wrote:

Hi,

I'm using pacemaker-1.1.10.
When a pacemaker's process crashed, the node is sometimes fenced or is not 
sometimes fenced.
Is this the assumed behavior?


Yes.

Sometimes the dev1 respawns the processes fast enough that dev2 gets the "hey, i'm 
back" notification before the PE gets run and fencing can be initiated.
In such cases, there is nothing to be gained from fencing - dev1 is reachable 
and responding.


OK... but I want pacemaker to certainly perform either behavior (fence is 
performed or fence is not performed), since operation is troublesome.
I think that it is better if user can specify behavior as an option.


This makes no sense. Sorry.
It is wrong to induce more downtime than absolutely necessary just to make a 
test pass.


If careful of the increase in downtime, isn't it better to prevent fencing, in 
this case?


With hindsight, yes.
But we have no way of knowing at the time.
If you want pacemaker to wait some time for it to come back, you can set 
crmd-transition-delay which will achieve the same thing it does for attrd.


I think that only a little is suitable for my demand because 
crmd-transition-delay is delay.


The only alternative to a delay, either by crmd-transition-delay or some other 
means, is that the crmd predicts the future.

Because pacemakerd respawns a broken child process, so the cluster will return 
to a online state.
If so, does subsequent fencing not increase a downtime?


Yes, but only we know that because we have more knowledge than the cluster.


Is it because stack is corosync?

No.

In pacemaker-1.0 with heartbeat, behavior when a child process crashed can be 
specified by ha.cf.
- when specified 'pacemaker respawn', the cluster will recover to online.


The node may still end up being fenced even with "pacemaker respawn".

If the node does not recover fast enough, relative to the "some process died" 
notification, then the node will get fenced.
If the "hey the process is back again" notification gets held up due to network 
congestion, then the node will get fenced.
Like most things in clustering, timing is hugely significant - consider a 
resource that fails just before vs. just after a monitor action is run

Now it could be that heartbeat is consistently slow sending out the "some process 
died" notification (I recall it does not send them at all sometimes), but that would 
be a bug not a feature.


Sorry, I mistook it.
You're right.

- when specified 'pacemaker on', the node will reboot by oneself.


"by oneself"?  Not because the other side fences it?


Yes, "by oneself".

[14:34:25 root@vm3 ~]$ gdb /usr/lib64/heartbeat/heartbeat 9876
 :
[14:35:33 root@vm3 ~]$ pkill -9 crmd
 :
(gdb) b cl_reboot
Breakpoint 2 at 0x7f0e433bdcf8
(gdb) c
Continuing.

Breakpoint 2, 0x00007f0e433bdcf8 in cl_reboot () from /usr/lib64/libplumb.so.2
(gdb) bt
#0  0x00007f0e433bdcf8 in cl_reboot () from /usr/lib64/libplumb.so.2
#1  0x000000000040d8e4 in ManagedChildDied (p=0x117f6e0, status=<value optimized 
out>, signo=9,
    exitcode=0, waslogged=1) at heartbeat.c:3906
#2  0x00007f0e433c8fcf in ReportProcHasDied () from /usr/lib64/libplumb.so.2
#3  0x00007f0e433c140c in ?? () from /usr/lib64/libplumb.so.2
#4  0x00007f0e433c0fe0 in ?? () from /usr/lib64/libplumb.so.2
#5  0x0000003240c38f0e in g_main_context_dispatch () from 
/lib64/libglib-2.0.so.0
#6  0x0000003240c3c938 in ?? () from /lib64/libglib-2.0.so.0
#7  0x0000003240c3cd55 in g_main_loop_run () from /lib64/libglib-2.0.so.0
#8  0x000000000040e8b8 in master_control_process () at heartbeat.c:1650
#9  initialize_heartbeat () at heartbeat.c:1041
#10 0x000000000040f38d in main (argc=<value optimized out>, argv=<value optimized 
out>, envp=
    0x7fffe0ba9bd8) at heartbeat.c:5133
(gdb) n

Message from syslogd@vm3 at Jul 25 14:36:57 ...
 heartbeat: [9876]: EMERG: Rebooting system.  Reason: /usr/lib64/heartbeat/crmd

I want to perform a setup and operation (established practice) equivalent to it.


This is a patch to add the option which can choose to reboot a machine at the 
time of child process failure.
https://github.com/inouekazu/pacemaker/commit/c1ac1048d8
What do you think?


Best regards.


It makes writing CTS tests hard, but it is not incorrect.


procedure:
$ systemctl start pacemaker
$ crm configure load update test.cli
$ pkill -9 lrmd

attachment:
STONITH.tar.bz2 : it's crm_report when fenced
notSTONITH.tar.bz2 : it's crm_report when not fenced

Best regards.
<notSTONITH.tar.bz2><STONITH.tar.bz2>_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Question about the behavior when a pacemaker's process crashed

Reply via email to