On 8 October 2010 09:29, Andrew Beekhof <and...@beekhof.net> wrote: > On Fri, Oct 8, 2010 at 8:34 AM, Pavlos Parissis > <pavlos.paris...@gmail.com> wrote: >> On 8 October 2010 08:29, Andrew Beekhof <and...@beekhof.net> wrote: >>> On Thu, Oct 7, 2010 at 9:58 PM, Pavlos Parissis >>> <pavlos.paris...@gmail.com> wrote: >>>> >>>> >>>> On 7 October 2010 09:01, Andrew Beekhof <and...@beekhof.net> wrote: >>>>> >>>>> On Sat, Oct 2, 2010 at 6:31 PM, Pavlos Parissis >>>>> <pavlos.paris...@gmail.com> wrote: >>>>> > Hi, >>>>> > >>>>> > I am having again the same issue, in a different set of 3 nodes. When I >>>>> > try >>>>> > to failover manually the resource group on the standby node, the ms-drbd >>>>> > resource is not moved as well and as a result the resource group is not >>>>> > fully started, only the ip resource is started. >>>>> > Any ideas why I am having this issue? >>>>> >>>>> I think its a bug that was fixed recently. Could you try the latest >>>>> from code Mercurial? >>>> >>>> 1.1 or 1.2 branch? >>> >>> 1.1 >>> >> to save time on compiling stuff I want to use the available rpms on >> 1.1.3 version from rpm-next repo. >> But before I go and recreate the scenario, which means rebuild 3 >> nodes, I would like to know if this bug is fixed in 1.1.3 > > As I said, I believe so. >
I've just upgraded[1] my pacemaker to 1.1.3 and stonithd can not be started, am I missing something? Oct 08 21:08:01 node-02 heartbeat: [14192]: info: Starting "/usr/lib/heartbeat/stonithd" as uid 0 gid 0 (pid 14192) Oct 08 21:08:01 node-02 heartbeat: [14193]: info: Starting "/usr/lib/heartbeat/attrd" as uid 101 gid 103 (pid 14193) Oct 08 21:08:01 node-02 heartbeat: [14194]: info: Starting "/usr/lib/heartbeat/crmd" as uid 101 gid 103 (pid 14194) Oct 08 21:08:01 node-02 ccm: [14189]: info: Hostname: node-02 Oct 08 21:08:01 node-02 cib: [14190]: WARN: ccm_connect: CCM Activation failed Oct 08 21:08:01 node-02 cib: [14190]: WARN: ccm_connect: CCM Connection failed 1 times (30 max) Oct 08 21:08:01 node-02 attrd: [14193]: info: Invoked: /usr/lib/heartbeat/attrd Oct 08 21:08:01 node-02 stonith-ng: [14192]: info: Invoked: /usr/lib/heartbeat/stonithd Oct 08 21:08:01 node-02 stonith-ng: [14192]: info: G_main_add_SignalHandler: Added signal handler for signal 17 Oct 08 21:08:01 node-02 heartbeat: [14158]: WARN: Client [stonith-ng] pid 14192 failed authorization [no default client auth] Oct 08 21:08:01 node-02 heartbeat: [14158]: ERROR: api_process_registration_msg: cannot add client(stonith-ng) Oct 08 21:08:01 node-02 stonith-ng: [14192]: ERROR: register_heartbeat_conn: Cannot sign on with heartbeat: Oct 08 21:08:01 node-02 stonith-ng: [14192]: CRIT: main: Cannot sign in to the cluster... terminating Oct 08 21:08:01 node-02 heartbeat: [14158]: WARN: Managed /usr/lib/heartbeat/stonithd process 14192 exited with return code 100. Oct 08 21:08:01 node-02 crmd: [14194]: info: Invoked: /usr/lib/heartbeat/crmd Oct 08 21:08:01 node-02 crmd: [14194]: info: G_main_add_SignalHandler: Added signal handler for signal 17 Oct 08 21:08:02 node-02 crmd: [14194]: WARN: do_cib_control: Couldn't complete CIB registration 1 times... pause and retry Oct 08 21:08:04 node-02 cib: [14190]: WARN: ccm_connect: CCM Activation failed Oct 08 21:08:04 node-02 cib: [14190]: WARN: ccm_connect: CCM Connection failed 2 times (30 max) Oct 08 21:08:05 node-02 crmd: [14194]: WARN: do_cib_control: Couldn't complete CIB registration 2 times... pause and retry [..snip...] Oct 08 21:08:33 node-02 crmd: [14194]: ERROR: te_connect_stonith: Sign-in failed: triggered a retry [1] I use CentOS 5.4 and when I did the installation I used the following repository [r...@node-02 ~]# cat /etc/yum.repos.d/pacemaker.repo [clusterlabs] name=High Availability/Clustering server technologies (epel-5) baseurl=http://www.clusterlabs.org/rpm/epel-5 type=rpm-md gpgcheck=0 enabled=1 and in order to perform the upgrade I added the following rep. [clusterlabs-next] name=High Availability/Clustering server technologies (epel-5-next) baseurl=http://www.clusterlabs.org/rpm-next/epel-5 metadata_expire=45m type=rpm-md gpgcheck=0 enabled=1 and here is the installation/upgrade log, where you can see only pacemaker-libs and pacemaker were upgraded. Oct 03 21:06:20 Installed: libibverbs-1.1.3-2.el5.i386 Oct 03 21:06:25 Installed: lm_sensors-2.10.7-9.el5.i386 Oct 03 21:06:31 Installed: 1:net-snmp-5.3.2.2-9.el5_5.1.i386 Oct 03 21:06:31 Installed: librdmacm-1.0.10-1.el5.i386 Oct 03 21:06:32 Installed: openhpi-libs-2.14.0-5.el5.i386 Oct 03 21:06:33 Installed: OpenIPMI-libs-2.0.16-7.el5.i386 Oct 03 21:06:35 Installed: libesmtp-1.0.4-5.el5.i386 Oct 03 21:06:36 Installed: cluster-glue-libs-1.0.6-1.6.el5.i386 Oct 03 21:06:37 Installed: heartbeat-libs-3.0.3-2.3.el5.i386 Oct 03 21:06:39 Installed: corosynclib-1.2.7-1.1.el5.i386 Oct 03 21:06:42 Installed: cluster-glue-1.0.6-1.6.el5.i386 Oct 03 21:06:45 Installed: resource-agents-1.0.3-2.6.el5.i386 Oct 03 21:06:46 Installed: heartbeat-3.0.3-2.3.el5.i386 Oct 03 21:06:47 Installed: pacemaker-libs-1.0.9.1-1.15.el5.i386 Oct 03 21:06:49 Installed: pacemaker-1.0.9.1-1.15.el5.i386 Oct 03 21:06:50 Installed: corosync-1.2.7-1.1.el5.i386 Oct 08 21:06:37 Updated: pacemaker-libs-1.1.3-1.el5.i386 Oct 08 21:06:43 Updated: pacemaker-1.1.3-1.el5.i386 and my conf [r...@node-02 log]# cibadmin -Ql|grep vali <cib validate-with="pacemaker-1.0" crm_feature_set="3.0.2" have-quorum="1" dc-uuid="b7764e7b-0a00-4745-8d9e-6911271eefb2" admin_epoch="0" epoch="319" num_updates="60"> [r...@node-02 log]# crm configure show node $id="80275014-5efe-4825-a29c-d42610f08cd1" node-02 node $id="b7764e7b-0a00-4745-8d9e-6911271eefb2" node-03 node $id="c7459ab3-55b6-4155-946d-5c1ba783507f" node-01 primitive drbd_01 ocf:linbit:drbd \ params drbd_resource="drbd_pbx_service_1" \ op monitor interval="30s" \ op start interval="0" timeout="240s" \ op stop interval="0" timeout="120s" primitive drbd_02 ocf:linbit:drbd \ params drbd_resource="drbd_pbx_service_2" \ op monitor interval="30s" \ op start interval="0" timeout="240s" \ op stop interval="0" timeout="120s" primitive fs_01 ocf:heartbeat:Filesystem \ params device="/dev/drbd1" directory="/pbx_service_01" fstype="ext3" \ meta migration-threshold="3" failure-timeout="60" \ op monitor interval="20s" timeout="40s" OCF_CHECK_LEVEL="20" \ op start interval="0" timeout="60s" \ op stop interval="0" timeout="60s" primitive fs_02 ocf:heartbeat:Filesystem \ params device="/dev/drbd2" directory="/pbx_service_02" fstype="ext3" \ meta migration-threshold="3" failure-timeout="60" \ op monitor interval="20s" timeout="40s" OCF_CHECK_LEVEL="20" \ op start interval="0" timeout="60s" \ op stop interval="0" timeout="60s" primitive ip_01 ocf:heartbeat:IPaddr2 \ params ip="192.168.78.10" cidr_netmask="24" broadcast="192.168.78.255" \ meta failure-timeout="120" migration-threshold="3" \ op monitor interval="5s" primitive ip_02 ocf:heartbeat:IPaddr2 \ params ip="192.168.78.20" cidr_netmask="24" broadcast="192.168.78.255" \ meta failure-timeout="120" migration-threshold="3" \ op monitor interval="5s" primitive pbx_01 lsb:znd-pbx_01 \ meta failure-timeout="120" migration-threshold="3" target-role="Started" \ op monitor interval="20s" timeout="40s" \ op start interval="0" timeout="60s" \ op stop interval="0" timeout="60s" primitive pbx_02 ocf:heartbeat:Dummy \ params state="/pbx_service_02/Dummy.state" \ meta failure-timeout="120" migration-threshold="3" \ op monitor interval="20s" timeout="40s" primitive sshd-pbx_01 lsb:sshd-pbx_01 \ meta target-role="Started" \ op monitor interval="10m" \ op start interval="0" timeout="60s" \ op stop interval="0" timeout="60s" primitive sshd-pbx_02 lsb:sshd-pbx_02 \ meta target-role="Started" \ op monitor interval="10m" \ op start interval="0" timeout="60s" \ op stop interval="0" timeout="60s" primitive stonith-meatware stonith:meatware \ params hostlist="node-01 node-02 node-03" stonith-timeout="60" \ op start interval="0" timeout="60s" \ op stop interval="0" timeout="60s" group pbx_service_01 ip_01 fs_01 pbx_01 sshd-pbx_01 \ meta target-role="Started" group pbx_service_02 ip_02 fs_02 pbx_02 sshd-pbx_02 \ meta target-role="Started" ms ms-drbd_01 drbd_01 \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" target-role="Started" ms ms-drbd_02 drbd_02 \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" target-role="Started" clone stonith-clone stonith-meatware \ meta clone-max="3" clone-node-max="1" target-role="Started" globally_unique="false" location PrimaryNode-drbd_01 ms-drbd_01 100: node-01 location PrimaryNode-drbd_02 ms-drbd_02 100: node-02 location PrimaryNode-pbx_service_01 pbx_service_01 200: node-01 location PrimaryNode-pbx_service_02 pbx_service_02 200: node-02 location SecondaryNode-drbd_01 ms-drbd_01 0: node-03 location SecondaryNode-drbd_02 ms-drbd_02 0: node-03 location SecondaryNode-pbx_service_01 pbx_service_01 10: node-03 location SecondaryNode-pbx_service_02 pbx_service_02 10: node-03 location stonith-node-01 stonith-clone 100: node-01 location stonith-node-02 stonith-clone 100: node-02 location stonith-node-03 stonith-clone 100: node-03 colocation fs_01-on-drbd_01 inf: fs_01 ms-drbd_01:Master colocation fs_02-on-drbd_02 inf: fs_02 ms-drbd_02:Master order pbx_service_01-after-drbd_01 inf: ms-drbd_01:promote pbx_service_01:start order pbx_service_02-after-drbd_02 inf: ms-drbd_02:promote pbx_service_02:start property $id="cib-bootstrap-options" \ stonith-enabled="true" \ symmetric-cluster="false" \ dc-version="1.1.3-9c2342c0378140df9bed7d192f2b9ed157908007" \ cluster-infrastructure="Heartbeat" \ last-lrm-refresh="1286195722" rsc_defaults $id="rsc-options" \ resource-stickiness="1000" [r...@node-02 log]# _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker