Re: [Pacemaker] Action from a different CRMD transition results in restarting services

Andrew Beekhof Thu, 13 Dec 2012 18:37:33 -0800

On Fri, Dec 14, 2012 at 1:33 AM, Latrous, Youssef
<ylatr...@broadviewnet.com> wrote:
>
> Andrew Beekhof <and...@beekhof.net> wrote:
>> 18014 is where we're up to now, 16048 is the (old) one that scheduled
> the recurring monitor operation.
>> I suspect you'll find the action failed earlier in the logs and thats
> why it needed to be restarted.
>>
>> Not the best log message though :(
>
> Thanks Andrew for the quick answer. I still need more info if possible.
>
> I searched everywhere for transaction 16048 and I couldn't find a trace
> of it (looked for up to 5 days of logs prior to transaction 18014).
> It would have been good if we had timestamps for each transaction
> involved in this situation :-)
>
> Is there a way to find about this old transaction in any other logs (I
> looked into /var/log/messages on both nodes involved in this cluster)?


Its not really relevant.
The only important thing is that its not one we're currently executing.

What you should care about is any logs that hopefully show you why the
resource failed at around Dec  6 22:55:05.

>
> To give you an idea of how many transactions happened during this
> period:
>    TR_ID 18010 @ 21:52:16
>    ...
>    TR_ID 18018 @ 22:55:25
>
> Over an hour between these two events.
>
> Given this, how come such a (very) old transaction (~2000 transactions
> before current one) only acts now? Could it be stale information in
> pacemaker?

No. It hasn't only just acted now. Its been repeating over and over
for the last few weeks or so.
The difference is that this time it failed.

>
> Thanks in advance.
>
> Youssef
>
>
> Message: 4  from Pacemaker Digest, Vol 61, Issue 34
> ---------------------------------------------------------------
> Date: Thu, 13 Dec 2012 10:52:42 +1100
> From: Andrew Beekhof <and...@beekhof.net>
> To: The Pacemaker cluster resource manager
>         <pacemaker@oss.clusterlabs.org>
> Subject: Re: [Pacemaker] Action from a different CRMD transition
>         results in restarting services
> Message-ID:
>
> <CAEDLWG2LtrPuxTRrd=jbv1sxtilbg3sb0nu0feyf3yrgrnc...@mail.gmail.com>
> Content-Type: text/plain; charset=windows-1252
>
> On Thu, Dec 13, 2012 at 6:31 AM, Latrous, Youssef
> <ylatr...@broadviewnet.com> wrote:
>> Hi,
>>
>>
>>
>> I run into the following issue and I couldn?t find what it really
> means:
>>
>>
>>
>>         Detected action msgbroker_monitor_10000 from a different
> transition:
>> 16048 vs. 18014
>
> 18014 is where we're up to now, 16048 is the (old) one that scheduled
> the recurring monitor operation.
> I suspect you'll find the action failed earlier in the logs and thats
> why it needed to be restarted.
>
> Not the best log message though :(
>
>>
>>
>>
>> I can see that its impact is to stop/start a service but I?d like to
>> understand it a bit more.
>>
>>
>>
>> Thank you in advance for any information.
>>
>>
>>
>>
>>
>> Logs about this issue:
>>
>> ?
>>
>> Dec  6 22:55:05 Node1 crmd: [5235]: info: process_graph_event:
>> Detected action msgbroker_monitor_10000 from a different transition:
>> 16048 vs. 18014
>>
>> Dec  6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph:
>> process_graph_event:477 - Triggered transition abort (complete=1,
>> tag=lrm_rsc_op, id=msgbroker_monitor_10000,
>> magic=0:7;104:16048:0:5fb57f01-3397-45a8-905f-c48cecdc8692,
> cib=0.971.5) :
>> Old event
>>
>> Dec  6 22:55:05 Node1 crmd: [5235]: WARN: update_failcount: Updating
>> failcount for msgbroker on Node0 after failed monitor: rc=7
>> (update=value++,
>> time=1354852505)
>>
>> Dec  6 22:55:05 Node1 crmd: [5235]: info: do_state_transition: State
>> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
>> cause=C_FSA_INTERNAL origin=abort_transition_graph ]
>>
>> Dec  6 22:55:05 Node1 crmd: [5235]: info: do_state_transition: All 2
>> cluster nodes are eligible to run resources.
>>
>> Dec  6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28069:
>> Requesting the current CIB: S_POLICY_ENGINE
>>
>> Dec  6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph:
>> te_update_diff:142 - Triggered transition abort (complete=1,
>> tag=nvpair, id=status-Node0-fail-count-msgbroker, magic=NA,
>> cib=0.971.6) : Transient
>> attribute: update
>>
>> Dec  6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28070:
>> Requesting the current CIB: S_POLICY_ENGINE
>>
>> Dec  6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph:
>> te_update_diff:142 - Triggered transition abort (complete=1,
>> tag=nvpair, id=status-Node0-last-failure-msgbroker, magic=NA,
>> cib=0.971.7) : Transient
>> attribute: update
>>
>> Dec  6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28071:
>> Requesting the current CIB: S_POLICY_ENGINE
>>
>> Dec  6 22:55:05 Node1 attrd: [5232]: info: find_hash_entry: Creating
>> hash entry for last-failure-msgbroker
>>
>> Dec  6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke_callback:
>> Invoking the PE: query=28071, ref=pe_calc-dc-1354852505-39407, seq=12,
>
>> quorate=1
>>
>> Dec  6 22:55:05 Node1 pengine: [5233]: notice: unpack_config: On loss
>> of CCM
>> Quorum: Ignore
>>
>> Dec  6 22:55:05 Node1 pengine: [5233]: notice: unpack_rsc_op:
>> Operation
>> txpublisher_monitor_0 found resource txpublisher active on Node1
>>
>> Dec  6 22:55:05 Node1 pengine: [5233]: WARN: unpack_rsc_op: Processing
>
>> failed op msgbroker_monitor_10000 on Node0: not running (7)
>>
>> ?
>>
>> Dec  6 22:55:05 Node1 pengine: [5233]: notice:
> common_apply_stickiness:
>> msgbroker can fail 999999 more times on Node0 before being forced off
>>
>> ?
>>
>> Dec  6 22:55:05 Node1 pengine: [5233]: notice: RecurringOp:  Start
>> recurring monitor (10s) for msgbroker on Node0
>>
>> ?
>>
>> Dec  6 22:55:05 Node1 pengine: [5233]: notice: LogActions: Recover
>> msgbroker (Started Node0)
>>
>> ?
>>
>> Dec  6 22:55:05 Node1 crmd: [5235]: info: te_rsc_command: Initiating
>> action
>> 37: stop msgbroker_stop_0 on Node0
>>
>>
>>
>>
>>
>> Transition 18014 details:
>>
>>
>>
>> Dec  6 22:52:18 Node1 pengine: [5233]: notice: process_pe_message:
>> Transition 18014: PEngine Input stored in:
>> /var/lib/pengine/pe-input-3270.bz2
>>
>> Dec  6 22:52:18 Node1 crmd: [5235]: info: do_state_transition: State
>> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
>
>> cause=C_IPC_MESSAGE origin=handle_response ]
>>
>> Dec  6 22:52:18 Node1 crmd: [5235]: info: unpack_graph: Unpacked
>> transition
>> 18014: 0 actions in 0 synapses
>>
>> Dec  6 22:52:18 Node1 crmd: [5235]: info: do_te_invoke: Processing
>> graph
>> 18014 (ref=pe_calc-dc-1354852338-39406) derived from
>> /var/lib/pengine/pe-input-3270.bz2
>>
>> Dec  6 22:52:18 Node1 crmd: [5235]: info: run_graph:
>> ====================================================
>>
>> Dec  6 22:52:18 Node1 crmd: [5235]: notice: run_graph: Transition
>> 18014 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
>> Source=/var/lib/pengine/pe-input-3270.bz2): Complete
>>
>> Dec  6 22:52:18 Node1 crmd: [5235]: info: te_graph_trigger: Transition
>
>> 18014 is now complete
>>
>> Dec  6 22:52:18 Node1 crmd: [5235]: info: notify_crmd: Transition
>> 18014
>> status: done - <null>
>>
>> Dec  6 22:52:18 Node1 crmd: [5235]: info: do_state_transition: State
>> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
>> cause=C_FSA_INTERNAL origin=notify_crmd ]
>>
>> Dec  6 22:52:18 Node1 crmd: [5235]: info: do_state_transition:
>> Starting PEngine Recheck Timer
>>
>>
>>
>>
>>
>> Youssef
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
>
>
> ------------------------------
>
> Message: 5
> Date: Thu, 13 Dec 2012 01:17:17 +0000
> From: Xavier Lashmar <xlash...@uottawa.ca>
> To: The Pacemaker cluster resource manager
>         <pacemaker@oss.clusterlabs.org>
> Subject: Re: [Pacemaker] gfs2 / dlm on centos 6.2
> Message-ID:
>
> <cc445c0ceb8b8a4c87297d880d8f903bbcc0f...@cms-p04.uottawa.o.univ>
> Content-Type: text/plain; charset="windows-1252"
>
> I see, thanks very much for pointing me in the right direction!
>
> Xavier Lashmar
> Universit? d'Ottawa / University of Ottawa Analyste de Syst?mes |
> Systems Analyst Service ?tudiants, service de l'informatique et des
> communications | Student services, computing and communications
> services.
> 1 Nicholas Street (810)
> Ottawa ON K1N 7B7
> T?l. | Tel. 613-562-5800 (2120)
> ________________________________
> From: Andrew Beekhof [and...@beekhof.net]
> Sent: Tuesday, December 11, 2012 9:30 PM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] gfs2 / dlm on centos 6.2
>
>
>
> On Wed, Dec 12, 2012 at 1:29 AM, Xavier Lashmar
> <xlash...@uottawa.ca<mailto:xlash...@uottawa.ca>> wrote:
> Hello,
>
> We are attempting to mount gfs2 partitions on CentOS using DRBD +
> COROSYNC + PACEMAKER.  Unfortunately we consistently get the following
> error:
>
> You'll need to configure pacemaker to use cman for this.
> See:
>
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html/Clusters_from
> _Scratch/ch08s02.html
>
>
> # mount /dev/vg_data/lv_data /webdata/ -t gfs2 -v mount /dev/dm-2
> /webdata
> parse_opts: opts = "rw"
>   clear flag 1 for "rw", flags = 0
> parse_opts: flags = 0
> parse_opts: extra = ""
> parse_opts: hostdata = ""
> parse_opts: lockproto = ""
> parse_opts: locktable = ""
> gfs_controld join connect error: Connection refused error mounting
> lockproto lock_dlm
>
> We are trying to find out where to get the lock_dlm libraries and
> packages for Centos 6.2 and 6.3
>
> Also, I found that the document ?Pacemaker 1.1 - Clusters from Scratch?
> the Fedora 17 version is a bit problematic.  I?m also running a Fedora
> 17 system and found no package ?dlm? as per the instructions in section
> 8.1.1
>
> yum install -y gfs2-utils dlm kernel-modules-extra
>
> Any idea if an external repository is needed?  If so, which one ? and
> which package do we need to install for CentOS 6+ ?
>
> Thanks very much
>
>
>
> [Description: Description: cid:D85E51EA-D618-4CBC-9F88-34F696123DED]
>
>
>
> Xavier Lashmar
> Analyste de Syst?mes | Systems Analyst
> Service ?tudiants, service de l'informatique et des
> communications/Student services, computing and communications services.
> 1 Nicholas Street (810)
> Ottawa ON K1N 7B7
> T?l. | Tel. 613-562-5800 (2120)
>
>
>
>
>
>
> _______________________________________________
> Pacemaker mailing list:
> Pacemaker@oss.clusterlabs.org<mailto:Pacemaker@oss.clusterlabs.org>
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20121213/d23
> bdf24/attachment.html>
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: image003.png
> Type: image/png
> Size: 916 bytes
> Desc: image003.png
> URL:
> <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20121213/d23
> bdf24/attachment.png>
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: image001.png
> Type: image/png
> Size: 989 bytes
> Desc: image001.png
> URL:
> <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20121213/d23
> bdf24/attachment-0001.png>
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: image002.png
> Type: image/png
> Size: 4219 bytes
> Desc: image002.png
> URL:
> <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20121213/d23
> bdf24/attachment-0002.png>
>
> ------------------------------
>
> _______________________________________________
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
>
> End of Pacemaker Digest, Vol 61, Issue 34
> *****************************************
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Action from a different CRMD transition results in restarting services

Reply via email to