Re: [Pacemaker] faq / howto needed for cib troubleshooting

Andrew Beekhof Thu, 08 Dec 2011 15:32:14 -0800

On Fri, Nov 25, 2011 at 8:44 AM, Attila Megyeri
<amegy...@minerva-soft.com> wrote:
> Hi Gents,
>
> I see from time to time that you are asking for "cibadmin -Ql" type outputs 
> to help people troubleshoot their problems.
>
> Currenty I have an issue promoting a MS resource (the PSQL issue in the 
> previous mail) - and I would like to start troubleshooting the problem, but 
> did not find any howto's or documentation on this topic.
> Could you  provide me any details on how to troubleshoot cib states?


Start with crm_mon -o
Then check what crm_simulate -L says.
Try adding additional -V arguments and grepping for your resource name.

> My current issue is that I have a MS resource that is started in slave/slave 
> mode, and the "promote" is never even called by the cib. I'd like to start 
> the research but have no idea how to do it.

Are you sure the promote doesnt happen?  No mention of it in the logs?

>
> I have read the pacemaker doc, as well as the cluster from srcatch doc, but 
> there are no troubleshooting hints.
>
> Thank you in advance,
>
> Attila
>
> -----Original Message-----
> From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
> Sent: 2011. november 23. 16:53
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed
>
> Hi Takatoshi, All,
>
> Thanks for your reply.
> I see that you have invested significant effort in the development of the RA. 
> I spent the last day trying to set up the RA, but without much success.
>
> My infrastructure is very similar to yours, except for the fact that 
> currently I am testing with a single network adapter.
>
> Replication works nicely when I start the databases manually, not using 
> corosync.
>
> When I try to start using corosync,I see that the ping resources start 
> normally, but the msPostgresql starts on both nodes in slave mode, and I see 
> "HS:alone"
>
> In the Wiki you state, the if I start on a signle node only, PSQL should 
> start in Master mode (PRI), but this is not the case.
>
> The recovery.conf file is created immediately, and from the logs I see no 
> attempt at all to promote the node.
> In the postgres logs I see that node1, which is supposed to be a master, 
> tries to connect to the vip-rep IP address, which is NOT brought up, because 
> it depends on the Master role...
>
> Do you have any idea?
>
>
> My environment:
> Debian Squeeze, with backported pacemaker (Version: 1.1.5) - official 
> pacemaker in debian is rather old and buggy Postgres 9.1, streaming 
> replication, sync mode
> Node1: psql1, 10.12.1.21
> Node1: psql2, 10.12.1.22
>
> Crm config:
>
> node psql1 \
>        attributes standby="off"
> node psql2 \
>        attributes standby="off"
> primitive pingCheck ocf:pacemaker:ping \
>        params name="default_ping_set" host_list="10.12.1.1" multiplier="100" \
>        op start interval="0s" timeout="60s" on-fail="restart" \
>        op monitor interval="10s" timeout="60s" on-fail="restart" \
>        op stop interval="0s" timeout="60s" on-fail="ignore"
> primitive postgresql ocf:heartbeat:pgsql \
>        params pgctl="/usr/lib/postgresql/9.1/bin/pg_ctl" psql="/usr/bin/psql" 
> pgdata="/var/lib/postgresql/9.1/main" 
> config="/etc/postgresql/9.1/main/postgresql.conf" 
> pgctldata="/usr/lib/postgresql/9.1/bin/pg_controldata" rep_mode="sync" 
> node_list="psql1 psql2" restore_command="cp 
> /var/lib/postgresql/9.1/main/pg_archive/%f %p" master_ip="10.12.1.28" \
>        op start interval="0s" timeout="60s" on-fail="restart" \
>        op monitor interval="7s" timeout="60s" on-fail="restart" \
>        op monitor interval="2s" role="Master" timeout="60s" on-fail="restart" 
> \
>        op promote interval="0s" timeout="60s" on-fail="restart" \
>        op demote interval="0s" timeout="60s" on-fail="block" \
>        op stop interval="0s" timeout="60s" on-fail="block" \
>        op notify interval="0s" timeout="60s"
> primitive vip-master ocf:heartbeat:IPaddr2 \
>        params ip="10.12.1.20" nic="eth0" cidr_netmask="24" \
>        op start interval="0s" timeout="60s" on-fail="restart" \
>        op monitor interval="10s" timeout="60s" on-fail="restart" \
>        op stop interval="0s" timeout="60s" on-fail="block" \
>        meta target-role="Started"
> primitive vip-rep ocf:heartbeat:IPaddr2 \
>        params ip="10.12.1.28" nic="eth0" cidr_netmask="24" \
>        op start interval="0s" timeout="60s" on-fail="restart" \
>        op monitor interval="10s" timeout="60s" on-fail="restart" \
>        op stop interval="0s" timeout="60s" on-fail="block" \
>        meta target-role="Started"
> primitive vip-slave ocf:heartbeat:IPaddr2 \
>        params ip="10.12.1.27" nic="eth0" cidr_netmask="24" \
>        meta resource-stickiness="1" \
>        op start interval="0s" timeout="60s" on-fail="restart" \
>        op monitor interval="10s" timeout="60s" on-fail="restart" \
>        op stop interval="0s" timeout="60s" on-fail="block"
> group master-group vip-master vip-rep
> ms msPostgresql postgresql \
>        meta master-max="1" master-node-max="1" clone-max="2" 
> clone-node-max="1" notify="true" target-role="Master"
> clone clnPingCheck pingCheck
> location rsc_location-1 vip-slave \
>        rule $id="rsc_location-1-rule" 200: pgsql-status eq HS:sync \
>        rule $id="rsc_location-1-rule-0" 100: pgsql-status eq PRI \
>        rule $id="rsc_location-1-rule-1" -inf: not_defined pgsql-status \
>        rule $id="rsc_location-1-rule-2" -inf: pgsql-status ne HS:sync and 
> pgsql-status ne PRI location rsc_location-2 msPostgresql \
>        rule $id="rsc_location-2-rule" $role="master" 200: #uname eq psql1 \
>        rule $id="rsc_location-2-rule-0" $role="master" 100: #uname eq psql2 \
>        rule $id="rsc_location-2-rule-1" $role="master" -inf: defined 
> fail-count-vip-master \
>        rule $id="rsc_location-2-rule-2" $role="master" -inf: defined 
> fail-count-vip-rep \
>        rule $id="rsc_location-2-rule-3" -inf: not_defined default_ping_set or 
> default_ping_set lt 100 colocation rsc_colocation-1 inf: msPostgresql 
> clnPingCheck colocation rsc_colocation-2 inf: master-group 
> msPostgresql:Master order rsc_order-1 0: clnPingCheck msPostgresql order 
> rsc_order-2 0: msPostgresql:promote master-group:start symmetrical=false 
> order rsc_order-3 0: msPostgresql:demote master-group:stop symmetrical=false 
> property $id="cib-bootstrap-options" \
>        dc-version="1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
>        cluster-infrastructure="openais" \
>        expected-quorum-votes="2" \
>        stonith-enabled="false" \
>        no-quorum-policy="ignore"
> rsc_defaults $id="rsc-options" \
>        resource-stickiness="INFINITY" \
>        migration-threshold="1"
>
>
>
> Regards,
> Attila
>
>
>
> -----Original Message-----
> From: Takatoshi MATSUO [mailto:matsuo....@gmail.com]
> Sent: 2011. november 17. 8:04
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed
>
> Hi  All
>
> I create a RA for PosstgrSQL 9.1 Streaming Replication based on pgsql.
>
> RA
>  https://github.com/t-matsuo/resource-agents/blob/pgsql91/heartbeat/pgsql
> Documents
>  https://github.com/t-matsuo/resource-agents/wiki
>
> It is almost totally changed from previous patch 
> http://lists.linux-ha.org/pipermail/linux-ha-dev/2011-February/018193.html
> .
> It create recovery.conf and promote PostgreSQL automatically.
> Additionally it can switch between the synchronous and asynchronous 
> replication automatically.
>
> If you please, use them and comment.
>
> Regards,
> Takatoshi MATSUO
>
> 2011/11/17 Serge Dubrouski <serge...@gmail.com>:
>>
>>
>> On Wed, Nov 16, 2011 at 12:55 PM, Attila Megyeri
>> <amegy...@minerva-soft.com>
>> wrote:
>>>
>>> Hi Florian,
>>>
>>> -----Original Message-----
>>> From: Florian Haas [mailto:flor...@hastexo.com]
>>> Sent: 2011. november 16. 11:49
>>> To: The Pacemaker cluster resource manager
>>> Subject: Re: [Pacemaker] Postgresql streaming replication failover -
>>> RA needed
>>>
>>> Hi Attila,
>>>
>>> On 2011-11-16 10:27, Attila Megyeri wrote:
>>> > Hi All,
>>> >
>>> >
>>> >
>>> > We have a two-node postgresql 9.1 system configured using streaming
>>> > replicaiton(active/active with a read-only slave).
>>> >
>>> > We want to automate the failover process and I couldn't really find
>>> > a resource agent that could do the job.
>>>
>>> That is correct; the pgsql resource agent (unlike its mysql
>>> counterpart) does not support streaming replication. We've had a
>>> contributor submit a patch at one point, but it was somewhat
>>> ill-conceived and thus did not make it into the upstream repo. The relevant 
>>> thread is here:
>>>
>>> http://lists.linux-ha.org/pipermail/linux-ha-dev/2011-February/018195
>>> .html
>>>
>>> Would you feel comfortable modifying the pgsql resource agent to
>>> support replication? If so, we could revisit this issue and
>>> potentially add streaming replication support to pgsql.
>>>
>>>
>>> Well I'm not sure I would be able to do that change. Failover is
>>> relatively easy to do but I really have no idea how to do the failback part.
>>
>> And that's exactly the reason why I haven't implemented it yet. With
>> the current way how replication is done in PostgreSQL there is no easy
>> way to switch between roles, or at least I don't know about a such way.
>> Implementing just fail-over functionality by creating a trigger file
>> on a slave server in the case of failure on master side doesn't create
>> a full master-slave implementation in my opinion.
>>
>>>
>>> I will definitively have to sort this out somehow, I am just unsure
>>> whether I will try to use the repmgr mentioned in the video, or
>>> pacemaker with some level of customization...
>>>
>>> Is the resource agent that you mentioned available somewhere?
>>>
>>> Thanks.
>>> Attila
>>>
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs:
>>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacem
>>> aker
>>
>>
>>
>> --
>> Serge Dubrouski.
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs:
>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacema
>> ker
>>
>>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] faq / howto needed for cib troubleshooting

Reply via email to