Re: [Pacemaker] Failed start of a resource after a Debian upgrading

Dejan Muhamedagic Tue, 24 Jan 2012 14:55:15 -0800

Hi,

On Tue, Jan 24, 2012 at 06:31:54PM +0100, Michal Vyoral wrote:
> Hello,
> we had a cluster of two nodes both running Debian 5.0, each with two resources
> IPaddr2 and apache managed by pacemaker 1.0.9.1. After an upgrading of
> one node from Debian 5.0 to 6.0 we have a problem to start the
> apache resource on the upgraded node. Here are the details:
> 
> Versions of heartbeat and pacemaker before the upgrade:
> pr-iso1:~# dpkg -l pacemaker heartbeat
> Desired=Unknown/Install/Remove/Purge/Hold
> | Status=Not/Inst/Cfg-files/Unpacked/Failed-cfg/Half-inst/trig-aWait/Trig-pend
> |/ Err?=(none)/Hold/Reinst-required/X=both-problems (Status,Err: 
> uppercase=bad)
> ||/ Name           Version        Description
> +++-==============-==============-============================================
> ii  heartbeat      1:3.0.3-2~bpo5 Subsystem for High-Availability Linux
> ii  pacemaker      1.0.9.1+hg1562 HA cluster resource manager
> 
> Versions of heartbeat and pacemaker after the upgrade:
> pr-iso2:~# dpkg -l pacemaker heartbeat
> Desired=Unknown/Install/Remove/Purge/Hold
> | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
> |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
> ||/ Name           Version        Description
> +++-==============-==============-============================================
> ii  heartbeat      1:3.0.3-2      Subsystem for High-Availability Linux
> ii  pacemaker      1.0.9.1+hg1562 HA cluster resource manager
> 
> Status of the resources on the upgraded node:
> pr-iso2:~# crm_mon
> ============
> Last updated: Tue Jan 24 10:14:12 2012
> Stack: Heartbeat
> Current DC: pr-iso2 (511079a9-0f71-4537-bdf9-07714b454441) - partition with 
> quorum
> Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
> 2 Nodes configured, unknown expected votes
> 2 Resources configured.
> ============
> 
> Online: [ pr-iso2 ]
> OFFLINE: [ pr-iso1 ]
> 
> ClusterIP       (ocf::heartbeat:IPaddr2):       Started pr-iso2
> 
> Failed actions:
>     RTWeb_start_0 (node=pr-iso2, call=7, rc=1, status=complete): unknown error
> 
> Status of the resources on the non upgraded node:
> pr-iso1:~# crm_mon
> ============
> Last updated: Tue Jan 24 17:08:22 2012
> Stack: Heartbeat
> Current DC: pr-iso1 (014268aa-f234-4789-b4a1-0053cf4e61b9) - partition with 
> quor
> um
> Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
> 2 Nodes configured, unknown expected votes
> 2 Resources configured.
> ============
> 
> Online: [ pr-iso1 pr-iso2 ]
> 
> ClusterIP       (ocf::heartbeat:IPaddr2):       Started pr-iso1
> RTWeb   (ocf::heartbeat:apache):        Started pr-iso1
> 
> Configuration of the resources:
> pr-iso1:~# crm configure show
> node $id="014268aa-f234-4789-b4a1-0053cf4e61b9" pr-iso1
> node $id="511079a9-0f71-4537-bdf9-07714b454441" pr-iso2
> primitive ClusterIP ocf:heartbeat:IPaddr2 \
>         params ip="10.5.75.83" cidr_netmask="24" \
>         op monitor interval="30s"
> primitive RTWeb ocf:heartbeat:apache \
>         params configfile="/etc/apache2/apache2.conf" \
>         op monitor interval="1min" \
>         meta target-role="Started" is-managed="true"
> colocation website-with-ip inf: RTWeb ClusterIP
> order rtweb_after_clustrip inf: ClusterIP RTWeb
> property $id="cib-bootstrap-options" \
>         dc-version="1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b" \
>         cluster-infrastructure="Heartbeat" \
>         stonith-enabled="false" \
>         last-lrm-refresh="1327399494"
> rsc_defaults $id="rsc-options" \
>         resource-stickiness="100"
> 
> Records in the /var/log/ha-log related to RTWeb resource:
> pr-iso2:~# grep RTWeb /var/log/ha-log
> Jan 24 10:04:56 pr-iso2 crmd: [6130]: info: do_lrm_rsc_op: Performing 
> key=7:76:7:41cbad9d-9090-4aba-bd6a-bf171077c74b op=RTWeb_monitor_0 )
> Jan 24 10:04:56 pr-iso2 lrmd: [6127]: info: rsc:RTWeb:4: probe
> Jan 24 10:04:56 pr-iso2 crmd: [6130]: info: process_lrm_event: LRM operation 
> RTWeb_monitor_0 (call=4, rc=7, cib-update=13, confirmed=true) not running
> Jan 24 10:12:48 pr-iso2 crmd: [6130]: info: do_lrm_rsc_op: Performing 
> key=11:77:0:41cbad9d-9090-4aba-bd6a-bf171077c74b op=RTWeb_start_0 )
> Jan 24 10:12:48 pr-iso2 lrmd: [6127]: info: rsc:RTWeb:7: start


After this message there should be a bit more (look for "apache"
or "lrmd"). Next resource agents are going to log the resource
name too (RTWeb in this case). If you cannot find anything here,
then the answer must be in the apache logs.

Thanks,

Dejan

> Jan 24 10:12:50 pr-iso2 crmd: [6130]: info: process_lrm_event: LRM operation 
> RTWeb_start_0 (call=7, rc=1, cib-update=16, confirmed=true) unknown error
> Jan 24 10:12:51 pr-iso2 attrd: [6129]: info: attrd_trigger_update: Sending 
> flush op to all hosts for: fail-count-RTWeb (INFINITY)
> Jan 24 10:12:51 pr-iso2 attrd: [6129]: info: attrd_perform_update: Sent 
> update 20: fail-count-RTWeb=INFINITY
> Jan 24 10:12:51 pr-iso2 attrd: [6129]: info: find_hash_entry: Creating hash 
> entry for last-failure-RTWeb
> Jan 24 10:12:51 pr-iso2 attrd: [6129]: info: attrd_trigger_update: Sending 
> flush op to all hosts for: last-failure-RTWeb (1327399969)
> Jan 24 10:12:51 pr-iso2 attrd: [6129]: info: attrd_perform_update: Sent 
> update 23: last-failure-RTWeb=1327399969
> Jan 24 10:13:08 pr-iso2 attrd: [6129]: info: attrd_trigger_update: Sending 
> flush op to all hosts for: fail-count-RTWeb (INFINITY)
> Jan 24 10:13:08 pr-iso2 attrd: [6129]: info: attrd_perform_update: Sent 
> update 27: fail-count-RTWeb=INFINITY
> Jan 24 10:13:08 pr-iso2 attrd: [6129]: info: attrd_trigger_update: Sending 
> flush op to all hosts for: last-failure-RTWeb (1327399969)
> Jan 24 10:13:08 pr-iso2 attrd: [6129]: info: attrd_perform_update: Sent 
> update 30: last-failure-RTWeb=1327399969
> Jan 24 10:13:08 pr-iso2 pengine: [9579]: WARN: unpack_rsc_op: Processing 
> failed op RTWeb_start_0 on pr-iso2: unknown error (1)
> Jan 24 10:13:08 pr-iso2 pengine: [9579]: notice: native_print: RTWeb    
> (ocf::heartbeat:apache):        Started pr-iso2 FAILED
> Jan 24 10:13:08 pr-iso2 pengine: [9579]: notice: RecurringOp:  Start 
> recurring monitor (60s) for RTWeb on pr-iso2
> Jan 24 10:13:08 pr-iso2 pengine: [9579]: notice: LogActions: Recover resource 
> RTWeb     (Started pr-iso2)
> Jan 24 10:13:08 pr-iso2 pengine: [9579]: WARN: unpack_rsc_op: Processing 
> failed op RTWeb_start_0 on pr-iso2: unknown error (1)
> Jan 24 10:13:08 pr-iso2 pengine: [9579]: notice: native_print: RTWeb    
> (ocf::heartbeat:apache):        Started pr-iso2 FAILED
> Jan 24 10:13:08 pr-iso2 pengine: [9579]: info: get_failcount: RTWeb has 
> failed INFINITY times on pr-iso2
> Jan 24 10:13:08 pr-iso2 pengine: [9579]: WARN: common_apply_stickiness: 
> Forcing RTWeb away from pr-iso2 after 1000000 failures (max=1000000)
> Jan 24 10:13:08 pr-iso2 pengine: [9579]: info: native_merge_weights: 
> ClusterIP: Rolling back scores from RTWeb
> Jan 24 10:13:08 pr-iso2 pengine: [9579]: info: native_color: Resource RTWeb 
> cannot run anywhere
> Jan 24 10:13:08 pr-iso2 pengine: [9579]: notice: LogActions: Stop resource 
> RTWeb        (pr-iso2)
> Jan 24 10:13:08 pr-iso2 crmd: [6130]: info: te_rsc_command: Initiating action 
> 1: stop RTWeb_stop_0 on pr-iso2 (local)
> Jan 24 10:13:08 pr-iso2 crmd: [6130]: info: do_lrm_rsc_op: Performing 
> key=1:1:0:1f379174-a1dd-413f-8108-466a22efdbd0 op=RTWeb_stop_0 )
> Jan 24 10:13:08 pr-iso2 lrmd: [6127]: info: rsc:RTWeb:8: stop
> Jan 24 10:13:09 pr-iso2 lrmd: [6127]: info: RA output: (RTWeb:stop:stderr) 
> kill: 830: No such process
> Jan 24 10:13:09 pr-iso2 crmd: [6130]: info: process_lrm_event: LRM operation 
> RTWeb_stop_0 (call=8, rc=0, cib-update=38, confirmed=true) ok
> Jan 24 10:13:09 pr-iso2 crmd: [6130]: info: match_graph_event: Action 
> RTWeb_stop_0 (1) confirmed on pr-iso2 (rc=0)
> Jan 24 10:14:51 pr-iso2 attrd: [6129]: WARN: attrd_cib_callback: Update 20 
> for fail-count-RTWeb=INFINITY failed: Remote node did not respond
> Jan 24 10:14:51 pr-iso2 attrd: [6129]: WARN: attrd_cib_callback: Update 23 
> for last-failure-RTWeb=1327399969 failed: Remote node did not respond
> Jan 24 10:15:01 pr-iso2 pengine: [9579]: WARN: unpack_rsc_op: Processing 
> failed op RTWeb_start_0 on pr-iso2: unknown error (1)
> Jan 24 10:15:01 pr-iso2 pengine: [9579]: notice: native_print: RTWeb    
> (ocf::heartbeat:apache):        Stopped
> Jan 24 10:15:01 pr-iso2 pengine: [9579]: info: get_failcount: RTWeb has 
> failed INFINITY times on pr-iso2
> Jan 24 10:15:01 pr-iso2 pengine: [9579]: WARN: common_apply_stickiness: 
> Forcing RTWeb away from pr-iso2 after 1000000 failures (max=1000000)
> Jan 24 10:15:01 pr-iso2 pengine: [9579]: info: native_merge_weights: 
> ClusterIP: Rolling back scores from RTWeb
> Jan 24 10:15:01 pr-iso2 pengine: [9579]: info: native_color: Resource RTWeb 
> cannot run anywhere
> Jan 24 10:15:01 pr-iso2 pengine: [9579]: notice: LogActions: Leave resource 
> RTWeb       (Stopped)
> Jan 24 10:26:22 pr-iso2 crmd: [9912]: info: do_lrm_rsc_op: Performing 
> key=8:2:7:2df45cf3-886f-4815-861e-079a2542e882 op=RTWeb_monitor_0 )
> Jan 24 10:26:22 pr-iso2 lrmd: [9909]: info: rsc:RTWeb:3: probe
> Jan 24 10:26:22 pr-iso2 crmd: [9912]: info: process_lrm_event: LRM operation 
> RTWeb_monitor_0 (call=3, rc=7, cib-update=8, confirmed=true) not running
> 
> How could we solve this problem? Thank you.
> 
> Best regards,
> Michal Vyoral
> 
> 
> _______________________________________________
> Pacemaker mailing list: [email protected]
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: [email protected]
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Failed start of a resource after a Debian upgrading

Reply via email to