Re: [Linux-HA] Resource-Group won't start - crm_mon does not react - no failures shown

Andrew Beekhof Tue, 12 Apr 2011 23:17:23 -0700

On Wed, Apr 13, 2011 at 12:23 AM, Stallmann, Andreas
<[email protected]> wrote:
> Hi!
>
> We've got a pretty straightforward and easy configuration:
>
> Corosync 1.2.1 / Pacemaker 2.0.0 on OpenSuSE 11.3 running DRBD (M/S), Ping 
> (clone), and a resource-group, containing a shared IP, tomcat and mysql 
> (where the datafiles of mysql reside on the DRBD). The cluster consists of 
> two virtual machines running on VMware ESXi 4.
>
> Since we moved the cluster to an other vmware esxi, strange things happen:
>
> While DRBD and the ping resource come up on both nodes, the resource group 
> "appl_grp" (see below) doesn't. No failures are shown in crm_mon and the 
> failcount is zero.


Is host_list="191.224.111.1 191.224.111.78 194.25.2.129" still valid?
If no value is being set for pingd then I can imagine this would be the result.

>
> Output of crm_mon:
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> ============
> Last updated: Tue Apr 12 23:39:39 2011
> Stack: openais
> Current DC: cms-appl02 - partition with quorum
> Version: 1.1.2-8b9ec9ccc5060457ac761dce1de719af86895b10
> 2 Nodes configured, 2 expected votes
> 3 Resources configured.
> ============
>
> Online: [ cms-appl01 cms-appl02 ]
>
> Master/Slave Set: ms_drbd_r0
>     Masters: [ cms-appl01 ]
>     Slaves: [ cms-appl02 ]
> Clone Set: pingy_clone
>     Started: [ cms-appl01 cms-appl02 ]
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Normally, I'd at least saw the resource group as stoped, but now it doesn't 
> even turn up in the crm_mon display!
>
> The crm-Tool at least shows, that the resources still exist:
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> crm(live)# resource
> crm(live)resource# show
> Resource Group: appl_grp
>     fs_r0      (ocf::heartbeat:Filesystem) Stopped
>     sharedIP   (ocf::heartbeat:IPaddr2) Stopped
>     tomcat_res (ocf::heartbeat:tomcat) Stopped
>     database_res       (ocf::heartbeat:mysql) Stopped
> Master/Slave Set: ms_drbd_r0
>     Masters: [ cms-appl01 ]
>     Slaves: [ cms-appl02 ]
> Clone Set: pingy_clone
>     Started: [ cms-appl01 cms-appl02 ]
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> And finally, here's our configuration:
>
> ~~~~~~~~~~~~~~output of "crm configure show"~~~~~~~~
> node cms-appl01
> node cms-appl02
> primitive database_res ocf:heartbeat:mysql \
>        params binary="/usr/bin/mysqld_safe" config="/etc/my.cnf" 
> datadir="/drbd/mysql" user="mysql" 
> log="/var/log/mysql/mysqld.logpid=/var/run/mysql/mysqld.pid" 
> socket="/drbd/run/mysql/mysql.sock" \
>        op start interval="0" timeout="120s" \
>        op stop interval="0" timeout="120s" \
>        op monitor interval="10s" timeout="30s" \
>        op notify interval="0" timeout="90s"
> primitive drbd_r0 ocf:linbit:drbd \
>        params drbd_resource="r0" \
>        op monitor interval="15s" \
>        op start interval="0" timeout="240s" \
>        op stop interval="0" timeout="100s"
> primitive fs_r0 ocf:heartbeat:Filesystem \
>        params device="/dev/drbd0" directory="/drbd" fstype="ext4" \
>        op start interval="0" timeout="60s" \
>        op stop interval="0" timeout="60s"
> primitive pingy_res ocf:pacemaker:ping \
>        params dampen="5s" multiplier="1000" host_list="191.224.111.1 
> 191.224.111.78 194.25.2.129" \
>        op monitor interval="60s" timeout="60s" \
>        op start interval="0" timeout="60s"
> primitive sharedIP ocf:heartbeat:IPaddr2 \
>        params ip="191.224.111.50" cidr_netmask="255.255.255.0" nic="eth0:0"
> primitive tomcat_res ocf:heartbeat:tomcat \
>        params java_home="/etc/alternatives/jre" \
>        params catalina_home="/usr/share/tomcat6" \
>        op start interval="0" timeout="60s" \
>        op stop interval="0" timeout="120s" \
>        op monitor interval="10s" timeout="30s"
> group appl_grp fs_r0 sharedIP tomcat_res database_res \
>        meta target-role="Started"
> ms ms_drbd_r0 drbd_r0 \
>        meta master-max="1" master-node-max="1" clone-max="2" 
> clone-node-max="1" notify="true"
> clone pingy_clone pingy_res
> location appl_loc appl_grp 100: cms-appl01
> location only-if-connected appl_grp \
>        rule $id="only-if-connected-rule" -inf: not_defined pingd or pingd lte 
> 2000
> colocation appl_grp-only-on-master inf: appl_grp ms_drbd_r0:Master
> order appl_grp-after-drbd inf: ms_drbd_r0:promote appl_grp:start
> order mysql-after-fs inf: fs_r0 database_res
> property $id="cib-bootstrap-options" \
>        stonith-enabled="false" \
>        no-quorum-policy="ignore" \
>        stonith-action="poweroff" \
>        default-resource-stickiness="100" \
>        dc-version="1.1.2-8b9ec9ccc5060457ac761dce1de719af86895b10" \
>        cluster-infrastructure="openais" \
>       expected-quorum-votes="2" \
>        last-lrm-refresh="1302643565"
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> When I (re)activate the appl_grp, literarily nothing happens:
>
> crm(live)resource# start nag_grp
>
> No new entries in /var/log/messages, no visible changes in crm_mon. It is as 
> if the resource didn't exist.
>
> Any ideas? You'll find the logs below.
>
> Cheers and good night,
>
> Andreas
>
> I found only one error message in /var/log/messages:
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Apr 12 23:56:11 cms-appl01 cib: [3888]: info: retrieveCib: Reading cluster 
> configuration from: /var/lib/heartbeat/crm/cib.dtzl4N (digest: 
> /var/lib/heartbeat/crm/cib.QPtzfE)
> Apr 12 23:56:11 cms-appl01 pengine: [2662]: info: process_pe_message: 
> Transition 0: PEngine Input stored in: /var/lib/pengine/pe-input-2971.bz2
> Apr 12 23:56:11 cms-appl01 crmd: [2663]: info: match_graph_event: Action 
> tomcat_res_monitor_0 (13) confirmed on cms-appl02 (rc=0)
> Apr 12 23:56:11 cms-appl01 crmd: [2663]: info: match_graph_event: Action 
> database_res_monitor_0 (14) confirmed on cms-appl02 (rc=0)
> Apr 12 23:56:11 cms-appl01 crmd: [2663]: info: process_lrm_event: LRM 
> operation tomcat_res_monitor_0 (call=4, rc=7, cib-update=31, confirmed=true) 
> not running
> Apr 12 23:56:11 cms-appl01 crmd: [2663]: info: match_graph_event: Action 
> tomcat_res_monitor_0 (6) confirmed on cms-appl01 (rc=0)
> Apr 12 23:56:11 cms-appl01 crmd: [2663]: info: match_graph_event: Action 
> fs_r0_monitor_0 (11) confirmed on cms-appl02 (rc=0)
> Apr 12 23:56:11 cms-appl01 crmd: [2663]: info: match_graph_event: Action 
> sharedIP_monitor_0 (12) confirmed on cms-appl02 (rc=0)
> Apr 12 23:56:11 cms-appl01 Filesystem[3889]: [3917]: WARNING: Couldn't find 
> device [/dev/drbd0]. Expected /dev/??? to exist
> Apr 12 23:56:11 cms-appl01 mysql[3892]: [3932]: ERROR: Datadir /drbd/mysql 
> doesn't exist
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> There are quite a lot of warnings:
>
> Apr 12 23:54:32 cms-appl01 cib: [2651]: WARN: send_ipc_message: IPC Channel 
> to 2655 is not connected
> Apr 12 23:54:32 cms-appl01 cib: [2651]: WARN: send_via_callback_channel: 
> Delivery of reply to client 2655/d4c6501f-32cb-49a4-a800-17d5385d71cb failed
> Apr 12 23:54:32 cms-appl01 cib: [2651]: WARN: do_local_notify: A-Sync reply 
> to crmd failed: reply failed
> Apr 12 23:55:04 cms-appl01 rchal: boot with 'CPUFREQ=no' in to avoid this 
> warning.
> Apr 12 23:55:15 cms-appl01 logd: [2326]: WARN: Core dumps could be lost if 
> multiple dumps occur.
> Apr 12 23:55:15 cms-appl01 logd: [2326]: WARN: Consider setting non-default 
> value in /proc/sys/kernel/core_pattern (or equivalent) for maximum 
> supportability
> Apr 12 23:55:15 cms-appl01 logd: [2326]: WARN: Consider setting 
> /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum supportability
> Apr 12 23:55:18 cms-appl01 corosync[2650]:  [pcmk  ] WARN: route_ais_message: 
> Sending message to local.crmd failed: ipc delivery failed (rc=-2)
> Apr 12 23:55:18 cms-appl01 corosync[2650]:  [pcmk  ] WARN: route_ais_message: 
> Sending message to local.crmd failed: ipc delivery failed (rc=-2)
> Apr 12 23:55:18 cms-appl01 mgmtd: [2664]: WARN: Core dumps could be lost if 
> multiple dumps occur.
> Apr 12 23:55:18 cms-appl01 mgmtd: [2664]: WARN: Consider setting non-default 
> value in /proc/sys/kernel/core_pattern (or equivalent) for maximum 
> supportability
> Apr 12 23:55:18 cms-appl01 mgmtd: [2664]: WARN: Consider setting 
> /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum supportability
> Apr 12 23:55:18 cms-appl01 mgmtd: [2664]: WARN: lrm_signon: can not initiate 
> connection
> Apr 12 23:55:18 cms-appl01 lrmd: [2660]: WARN: Core dumps could be lost if 
> multiple dumps occur.
> Apr 12 23:55:18 cms-appl01 lrmd: [2660]: WARN: Consider setting non-default 
> value in /proc/sys/kernel/core_pattern (or equivalent) for maximum 
> supportability
> Apr 12 23:55:18 cms-appl01 lrmd: [2660]: WARN: Consider setting 
> /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum supportability
> Apr 12 23:56:11 cms-appl01 crmd: [2663]: WARN: 
> cib_client_add_notify_callback: Callback already present
> Apr 12 23:56:11 cms-appl01 Filesystem[3889]: [3917]: WARNING: Couldn't find 
> device [/dev/drbd0]. Expected /dev/??? to exist
> Apr 12 23:56:11 cms-appl01 lrmd: [2660]: info: RA output: 
> (sharedIP:probe:stderr) Converted dotted-quad netmask to CIDR as: 
> 24#012eth0:0: warning: name may be invalid
>
>
>
> --
> CONET Solutions GmbH
> Andreas Stallmann,
> Theodor-Heuss-Allee 19, 53773 Hennef
> Tel.: +49 2242 939-677, Fax: +49 2242 939-393
> Mobil: +49 172 2455051
> Internet: http://www.conet.de, mailto: 
> [email protected]<mailto:[email protected]>
>
> ------------------------
> CONET Solutions GmbH, Theodor-Heuss-Allee 19, 53773 Hennef.
> Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 9136)
> Gesch?ftsf?hrer/Managing Directors: J?rgen Zender (Sprecher/Chairman), Anke 
> H?fer
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Resource-Group won't start - crm_mon does not react - no failures shown

Reply via email to