Hello Maybe my question is stupid, but are you root when you try to killing the procs?
Thanks 2013/1/9 Kazunori INOUE <inouek...@intellilink.co.jp> > Hi Andrew, > > I have another question about this subject. > Even if pengine, stonithd, and attrd crash after pacemakerd is killed > (for example, killed by OOM_Killer), node status does not change. > > * pseudo testcase > > [dev1 ~]$ crm configure show > node $id="2472913088" dev2 > node $id="2506467520" dev1 > primitive prmDummy ocf:pacemaker:Dummy \ > op monitor on-fail="restart" interval="10s" > property $id="cib-bootstrap-options" \ > dc-version="1.1.8-d20d06f" \ > cluster-infrastructure="**corosync" \ > no-quorum-policy="ignore" \ > stonith-enabled="false" \ > startup-fencing="false" > rsc_defaults $id="rsc-options" \ > resource-stickiness="INFINITY" \ > migration-threshold="1" > > [dev1 ~]$ pkill -9 pacemakerd > [dev1 ~]$ pkill -9 pengine > [dev1 ~]$ pkill -9 stonithd > [dev1 ~]$ pkill -9 attrd > > [dev1 ~]$ ps -ef|egrep 'corosync|pacemaker' > root 19124 1 0 14:27 ? 00:00:01 corosync > 496 19144 1 0 14:27 ? 00:00:00 /usr/libexec/pacemaker/cib > root 19146 1 0 14:27 ? 00:00:00 /usr/libexec/pacemaker/lrmd > 496 19149 1 0 14:27 ? 00:00:00 /usr/libexec/pacemaker/crmd > > [dev1 ~]$ crm_mon -1 > : > Stack: corosync > Current DC: dev2 (2472913088) - partition with quorum > Version: 1.1.8-d20d06f > 2 Nodes configured, unknown expected votes > 1 Resources configured. > > > Online: [ dev1 dev2 ] > > prmDummy (ocf::pacemaker:Dummy): Started dev1 > > Node (dev1) remains Online. > When other processes such as lrmd crash, it becomes "UNCLEAN (offline)". > Is this a bug? Or specifications? > > Best Regards, > Kazunori INOUE > > > (13.01.08 09:16), Andrew Beekhof wrote: > >> On Wed, Dec 19, 2012 at 8:15 PM, Kazunori INOUE >> <inouek...@intellilink.co.jp> wrote: >> >>> (12.12.13 08:26), Andrew Beekhof wrote: >>> >>>> >>>> On Wed, Dec 12, 2012 at 8:02 PM, Kazunori INOUE >>>> <inouek...@intellilink.co.jp> wrote: >>>> >>>>> >>>>> >>>>> Hi, >>>>> >>>>> I recognize that pacemakerd is much less likely to crash. >>>>> However, a possibility of being killed by OOM_Killer etc. is not 0%. >>>>> >>>> >>>> >>>> True. Although we just established in another thread that we don't >>>> have any leaks :) >>>> >>>> So I think that a user gets confused. since behavior at the time of >>>>> process >>>>> death differs even if pacemakerd is running. >>>>> >>>>> case A) >>>>> When pacemakerd and other processes (crmd etc.) are the parent-child >>>>> relation. >>>>> >>>>> >>>> [snip] >>>> >>>> >>>>> For example, crmd died. >>>>> However, since it is relaunched, the state of the cluster is not >>>>> affected. >>>>> >>>> >>>> >>>> Right. >>>> >>>> [snip] >>>> >>>> >>>>> case B) >>>>> When pacemakerd and other processes are NOT the parent-child >>>>> relation. >>>>> Although pacemakerd was killed, it assumed the state where it was >>>>> respawned >>>>> by Upstart. >>>>> >>>>> $ service corosync start ; service pacemaker start >>>>> $ pkill -9 pacemakerd >>>>> $ ps -ef|egrep 'corosync|pacemaker|UID' >>>>> UID PID PPID C STIME TTY TIME CMD >>>>> root 21091 1 1 14:52 ? 00:00:00 corosync >>>>> 496 21099 1 0 14:52 ? 00:00:00 >>>>> /usr/libexec/pacemaker/cib >>>>> root 21100 1 0 14:52 ? 00:00:00 >>>>> /usr/libexec/pacemaker/**stonithd >>>>> root 21101 1 0 14:52 ? 00:00:00 >>>>> /usr/libexec/pacemaker/lrmd >>>>> 496 21102 1 0 14:52 ? 00:00:00 >>>>> /usr/libexec/pacemaker/attrd >>>>> 496 21103 1 0 14:52 ? 00:00:00 >>>>> /usr/libexec/pacemaker/pengine >>>>> 496 21104 1 0 14:52 ? 00:00:00 >>>>> /usr/libexec/pacemaker/crmd >>>>> root 21128 1 1 14:53 ? 00:00:00 /usr/sbin/pacemakerd >>>>> >>>> >>>> >>>> Yep, looks right. >>>> >>>> >>> Hi Andrew, >>> >>> We discussed this behavior. >>> Behavior when pacemakerd and other processes are not parent-child >>> relation (case B) reached the conclusion that there is room for >>> improvement. >>> >>> Since not all users are experts, they may kill pacemakerd accidentally. >>> Such a user will get confused if the behavior after crmd death changes >>> with the following conditions. >>> case A: pacemakerd and others (crmd etc.) are the parent-child relation. >>> case B: pacemakerd and others are not the parent-child relation. >>> >>> So, we want to *always* obtain the same behavior as the case where >>> there is parent-child relation. >>> That is, when crmd etc. die, we want pacemaker to always relaunch >>> the process always immediately. >>> >> >> No. Sorry. >> Writing features to satisfy an artificial test case is not a good >> practice. >> >> We can speed up the failure detection for case B (I'll agree that 60s >> is way too long, 5s or 2s might be better depending on the load is >> creates), but causing downtime now to _maybe_ avoid downtime in the >> future makes no sense. >> Especially when you consider that the node will likely be fenced if >> the crmd fails anyway. >> >> Take a look at the logs from a some ComponentFail test runs and you'll >> see that the parent-child relationship regularly _fails_ to prevent >> downtime. >> >> >>> Regards, >>> Kazunori INOUE >>> >>> >>> In this case, the node will be set to UNCLEAN if crmd dies. >>>>> That is, the node will be fenced if there is stonith resource. >>>>> >>>> >>>> >>>> Which is exactly what happens if only pacemakerd is killed with your >>>> proposal. >>>> Except now you have time to do a graceful pacemaker restart to >>>> re-establish the parent-child relationship. >>>> >>>> If you want to compare B with something, it needs to be with the old >>>> "children terminate if pacemakerd dies" strategy. >>>> Which is: >>>> >>>> $ service corosync start ; service pacemaker start >>>>> $ pkill -9 pacemakerd >>>>> ... the node will be set to UNCLEAN >>>>> >>>> >>>> >>>> Old way: always downtime because children terminate which triggers >>>> fencing >>>> Our way: no downtime unless there is an additional failure (to the cib >>>> or >>>> crmd) >>>> >>>> Given that we're trying for HA, the second seems preferable. >>>> >>>> >>>>> $ pkill -9 crmd >>>>> $ crm_mon -1 >>>>> Last updated: Wed Dec 12 14:53:48 2012 >>>>> Last change: Wed Dec 12 14:53:10 2012 via crmd on dev2 >>>>> >>>>> Stack: corosync >>>>> Current DC: dev2 (2472913088) - partition with quorum >>>>> Version: 1.1.8-3035414 >>>>> >>>>> 2 Nodes configured, unknown expected votes >>>>> 0 Resources configured. >>>>> >>>>> Node dev1 (2506467520): UNCLEAN (online) >>>>> Online: [ dev2 ] >>>>> >>>>> >>>>> How about making behavior selectable with an option? >>>>> >>>> >>>> >>>> MORE_DOWNTIME_PLEASE=(true|**false) ? >>>> >>>> >>>>> When pacemakerd dies, >>>>> mode A) which behaves in an existing way. (default) >>>>> mode B) which makes the node UNCLEAN. >>>>> >>>>> Best Regards, >>>>> Kazunori INOUE >>>>> >>>>> >>>>> >>>>> Making stop work when there is no pacemakerd process is a different >>>>>> matter. We can make that work. >>>>>> >>>>>> >>>>>>> Though the best solution is to relaunch pacemakerd, if it is >>>>>>> difficult, >>>>>>> I think that a shortcut method is to make a node unclean. >>>>>>> >>>>>>> >>>>>>> And now, I tried Upstart a little bit. >>>>>>> >>>>>>> 1) started the corosync and pacemaker. >>>>>>> >>>>>>> $ cat /etc/init/pacemaker.conf >>>>>>> respawn >>>>>>> script >>>>>>> [ -f /etc/sysconfig/pacemaker ] && { >>>>>>> . /etc/sysconfig/pacemaker >>>>>>> } >>>>>>> exec /usr/sbin/pacemakerd >>>>>>> end script >>>>>>> >>>>>>> $ service co start >>>>>>> Starting Corosync Cluster Engine (corosync): [ OK >>>>>>> ] >>>>>>> $ initctl start pacemaker >>>>>>> pacemaker start/running, process 4702 >>>>>>> >>>>>>> >>>>>>> $ ps -ef|egrep 'corosync|pacemaker' >>>>>>> root 4695 1 0 17:21 ? 00:00:00 corosync >>>>>>> root 4702 1 0 17:21 ? 00:00:00 /usr/sbin/pacemakerd >>>>>>> 496 4703 4702 0 17:21 ? 00:00:00 /usr/libexec/pacemaker/cib >>>>>>> root 4704 4702 0 17:21 ? 00:00:00 >>>>>>> /usr/libexec/pacemaker/**stonithd >>>>>>> root 4705 4702 0 17:21 ? 00:00:00 >>>>>>> /usr/libexec/pacemaker/lrmd >>>>>>> 496 4706 4702 0 17:21 ? 00:00:00 >>>>>>> /usr/libexec/pacemaker/attrd >>>>>>> 496 4707 4702 0 17:21 ? 00:00:00 >>>>>>> /usr/libexec/pacemaker/pengine >>>>>>> 496 4708 4702 0 17:21 ? 00:00:00 /usr/libexec/pacemaker/crmd >>>>>>> >>>>>>> 2) killed pacemakerd. >>>>>>> >>>>>>> $ pkill -9 pacemakerd >>>>>>> >>>>>>> $ ps -ef|egrep 'corosync|pacemaker' >>>>>>> root 4695 1 0 17:21 ? 00:00:01 corosync >>>>>>> 496 4703 1 0 17:21 ? 00:00:00 >>>>>>> /usr/libexec/pacemaker/cib >>>>>>> root 4704 1 0 17:21 ? 00:00:00 >>>>>>> /usr/libexec/pacemaker/**stonithd >>>>>>> root 4705 1 0 17:21 ? 00:00:00 >>>>>>> /usr/libexec/pacemaker/lrmd >>>>>>> 496 4706 1 0 17:21 ? 00:00:00 >>>>>>> /usr/libexec/pacemaker/attrd >>>>>>> 496 4707 1 0 17:21 ? 00:00:00 >>>>>>> /usr/libexec/pacemaker/pengine >>>>>>> 496 4708 1 0 17:21 ? 00:00:00 >>>>>>> /usr/libexec/pacemaker/crmd >>>>>>> root 4760 1 1 17:24 ? 00:00:00 /usr/sbin/pacemakerd >>>>>>> >>>>>>> 3) then I stopped pacemakerd. however, some processes did not stop. >>>>>>> >>>>>>> $ initctl stop pacemaker >>>>>>> pacemaker stop/waiting >>>>>>> >>>>>>> >>>>>>> $ ps -ef|egrep 'corosync|pacemaker' >>>>>>> root 4695 1 0 17:21 ? 00:00:01 corosync >>>>>>> 496 4703 1 0 17:21 ? 00:00:00 >>>>>>> /usr/libexec/pacemaker/cib >>>>>>> root 4704 1 0 17:21 ? 00:00:00 >>>>>>> /usr/libexec/pacemaker/**stonithd >>>>>>> root 4705 1 0 17:21 ? 00:00:00 >>>>>>> /usr/libexec/pacemaker/lrmd >>>>>>> 496 4706 1 0 17:21 ? 00:00:00 >>>>>>> /usr/libexec/pacemaker/attrd >>>>>>> 496 4707 1 0 17:21 ? 00:00:00 >>>>>>> /usr/libexec/pacemaker/pengine >>>>>>> >>>>>>> Best Regards, >>>>>>> Kazunori INOUE >>>>>>> >>>>>>> >>>>>>> This isnt the case when the plugin is in use though, but then I'd >>>>>>>>>> also >>>>>>>>>> have expected most of the processes to die also. >>>>>>>>>> >>>>>>>>>> Since node status will also change if such a result is brought, >>>>>>>>> we desire to become so. >>>>>>>>> >>>>>>>>> >>>>>>>>>>> ---- >>>>>>>>>>> $ cat /etc/redhat-release >>>>>>>>>>> Red Hat Enterprise Linux Server release 6.3 (Santiago) >>>>>>>>>>> >>>>>>>>>>> $ ./configure --sysconfdir=/etc --localstatedir=/var >>>>>>>>>>> --without-cman >>>>>>>>>>> --without-heartbeat >>>>>>>>>>> -snip- >>>>>>>>>>> pacemaker configuration: >>>>>>>>>>> Version = 1.1.8 (Build: 9c13d14) >>>>>>>>>>> Features = generated-manpages >>>>>>>>>>> agent-manpages >>>>>>>>>>> ascii-docs >>>>>>>>>>> publican-docs ncurses libqb-logging libqb-ipc lha-fencing >>>>>>>>>>> corosync-native >>>>>>>>>>> snmp >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> $ cat config.log >>>>>>>>>>> -snip- >>>>>>>>>>> 6000 | #define BUILD_VERSION "9c13d14" >>>>>>>>>>> 6001 | /* end confdefs.h. */ >>>>>>>>>>> 6002 | #include <gio/gio.h> >>>>>>>>>>> 6003 | >>>>>>>>>>> 6004 | int >>>>>>>>>>> 6005 | main () >>>>>>>>>>> 6006 | { >>>>>>>>>>> 6007 | if (sizeof (GDBusProxy)) >>>>>>>>>>> 6008 | return 0; >>>>>>>>>>> 6009 | ; >>>>>>>>>>> 6010 | return 0; >>>>>>>>>>> 6011 | } >>>>>>>>>>> 6012 configure:32411: result: no >>>>>>>>>>> 6013 configure:32417: WARNING: Unable to support systemd/upstart. >>>>>>>>>>> You need >>>>>>>>>>> to use glib >= 2.26 >>>>>>>>>>> -snip- >>>>>>>>>>> 6286 | #define BUILD_VERSION "9c13d14" >>>>>>>>>>> 6287 | #define SUPPORT_UPSTART 0 >>>>>>>>>>> 6288 | #define SUPPORT_SYSTEMD 0 >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Best Regards, >>>>>>>>>>> Kazunori INOUE >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> related bugzilla: >>>>>>>>>>>>> http://bugs.clusterlabs.org/**show_bug.cgi?id=5064<http://bugs.clusterlabs.org/show_bug.cgi?id=5064> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Best Regards, >>>>>>>>>>>>> Kazunori INOUE >>>>>>>>>>>>> >>>>>>>>>>>>> ______________________________**_________________ >>>>>>>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>>>>>>>> http://oss.clusterlabs.org/**mailman/listinfo/pacemaker<http://oss.clusterlabs.org/mailman/listinfo/pacemaker> >>>>>>>>>>>>> >>>>>>>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>>>>>>> Getting started: >>>>>>>>>>>>> http://www.clusterlabs.org/**doc/Cluster_from_Scratch.pdf<http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf> >>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>>>>>>>> >>>>>>>>>>>> >>> >>> ______________________________**_________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/**mailman/listinfo/pacemaker<http://oss.clusterlabs.org/mailman/listinfo/pacemaker> >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/** >>> doc/Cluster_from_Scratch.pdf<http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf> >>> Bugs: http://bugs.clusterlabs.org >>> >> >> ______________________________**_________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/**mailman/listinfo/pacemaker<http://oss.clusterlabs.org/mailman/listinfo/pacemaker> >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/** >> doc/Cluster_from_Scratch.pdf<http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf> >> Bugs: http://bugs.clusterlabs.org >> > > ______________________________**_________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/**mailman/listinfo/pacemaker<http://oss.clusterlabs.org/mailman/listinfo/pacemaker> > > Project Home: http://www.clusterlabs.org > Getting started: > http://www.clusterlabs.org/**doc/Cluster_from_Scratch.pdf<http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf> > Bugs: http://bugs.clusterlabs.org > -- esta es mi vida e me la vivo hasta que dios quiera
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org