Re: [Pacemaker] why pacemaker does not control the resources

Andrew Beekhof Tue, 12 Nov 2013 15:23:06 -0800

On 12 Nov 2013, at 4:42 pm, Andrey Groshev <gre...@yandex.ru> wrote:


> 
> 
> 11.11.2013, 03:44, "Andrew Beekhof" <and...@beekhof.net>:
>> On 8 Nov 2013, at 7:49 am, Andrey Groshev <gre...@yandex.ru> wrote:
>> 
>>>  Hi, PPL!
>>>  I need help. I do not understand... Why has stopped working.
>>>  This configuration work on other cluster, but on corosync1.
>>> 
>>>  So... cluster postgres with master/slave.
>>>  Classic config as in wiki.
>>>  I build cluster, start, he is working.
>>>  Next I kill postgres on Master with 6 signal, as if "disk space left"
>>> 
>>>  # pkill -6 postgres
>>>  # ps axuww|grep postgres
>>>  root      9032  0.0  0.1 103236   860 pts/0    S+   00:37   0:00 grep 
>>> postgres
>>> 
>>>  PostgreSQL die, But crm_mon shows that the master is still running.
>>> 
>>>  Last updated: Fri Nov  8 00:42:08 2013
>>>  Last change: Fri Nov  8 00:37:05 2013 via crm_attribute on 
>>> dev-cluster2-node4
>>>  Stack: corosync
>>>  Current DC: dev-cluster2-node4 (172793107) - partition with quorum
>>>  Version: 1.1.10-1.el6-368c726
>>>  3 Nodes configured
>>>  7 Resources configured
>>> 
>>>  Node dev-cluster2-node2 (172793105): online
>>>         pingCheck       (ocf::pacemaker:ping):  Started
>>>         pgsql   (ocf::heartbeat:pgsql): Started
>>>  Node dev-cluster2-node3 (172793106): online
>>>         pingCheck       (ocf::pacemaker:ping):  Started
>>>         pgsql   (ocf::heartbeat:pgsql): Started
>>>  Node dev-cluster2-node4 (172793107): online
>>>         pgsql   (ocf::heartbeat:pgsql): Master
>>>         pingCheck       (ocf::pacemaker:ping):  Started
>>>         VirtualIP       (ocf::heartbeat:IPaddr2):       Started
>>> 
>>>  Node Attributes:
>>>  * Node dev-cluster2-node2:
>>>     + default_ping_set                  : 100
>>>     + master-pgsql                      : -INFINITY
>>>     + pgsql-data-status                 : STREAMING|ASYNC
>>>     + pgsql-status                      : HS:async
>>>  * Node dev-cluster2-node3:
>>>     + default_ping_set                  : 100
>>>     + master-pgsql                      : -INFINITY
>>>     + pgsql-data-status                 : STREAMING|ASYNC
>>>     + pgsql-status                      : HS:async
>>>  * Node dev-cluster2-node4:
>>>     + default_ping_set                  : 100
>>>     + master-pgsql                      : 1000
>>>     + pgsql-data-status                 : LATEST
>>>     + pgsql-master-baseline             : 0000000002000078
>>>     + pgsql-status                      : PRI
>>> 
>>>  Migration summary:
>>>  * Node dev-cluster2-node4:
>>>  * Node dev-cluster2-node2:
>>>  * Node dev-cluster2-node3:
>>> 
>>>  Tickets:
>>> 
>>>  CONFIG:
>>>  node $id="172793105" dev-cluster2-node2. \
>>>         attributes pgsql-data-status="STREAMING|ASYNC" standby="false"
>>>  node $id="172793106" dev-cluster2-node3. \
>>>         attributes pgsql-data-status="STREAMING|ASYNC" standby="false"
>>>  node $id="172793107" dev-cluster2-node4. \
>>>         attributes pgsql-data-status="LATEST"
>>>  primitive VirtualIP ocf:heartbeat:IPaddr2 \
>>>         params ip="10.76.157.194" \
>>>         op start interval="0" timeout="60s" on-fail="stop" \
>>>         op monitor interval="10s" timeout="60s" on-fail="restart" \
>>>         op stop interval="0" timeout="60s" on-fail="block"
>>>  primitive pgsql ocf:heartbeat:pgsql \
>>>         params pgctl="/usr/pgsql-9.1/bin/pg_ctl" 
>>> psql="/usr/pgsql-9.1/bin/psql" pgdata="/var/lib/pgsql/9.1/data" 
>>> tmpdir="/tmp/pg" start_opt="-p 5432" 
>>> logfile="/var/lib/pgsql/9.1//pgstartup.log" rep_mode="async" node_list=" 
>>> dev-cluster2-node2. dev-cluster2-node3. dev-cluster2-node4. " 
>>> restore_command="gzip -cd 
>>> /var/backup/pitr/dev-cluster2-master#5432/xlog/%f.gz > %p" 
>>> primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 
>>> keepalives_count=5" master_ip="10.76.157.194" \
>>>         op start interval="0" timeout="60s" on-fail="restart" \
>>>         op monitor interval="5s" timeout="61s" on-fail="restart" \
>>>         op monitor interval="1s" role="Master" timeout="62s" 
>>> on-fail="restart" \
>>>         op promote interval="0" timeout="63s" on-fail="restart" \
>>>         op demote interval="0" timeout="64s" on-fail="stop" \
>>>         op stop interval="0" timeout="65s" on-fail="block" \
>>>         op notify interval="0" timeout="66s"
>>>  primitive pingCheck ocf:pacemaker:ping \
>>>         params name="default_ping_set" host_list="10.76.156.1" 
>>> multiplier="100" \
>>>         op start interval="0" timeout="60s" on-fail="restart" \
>>>         op monitor interval="10s" timeout="60s" on-fail="restart" \
>>>         op stop interval="0" timeout="60s" on-fail="ignore"
>>>  ms msPostgresql pgsql \
>>>         meta master-max="1" master-node-max="1" clone-node-max="1" 
>>> notify="true" target-role="Master" clone-max="3"
>>>  clone clnPingCheck pingCheck \
>>>         meta clone-max="3"
>>>  location l0_DontRunPgIfNotPingGW msPostgresql \
>>>         rule $id="l0_DontRunPgIfNotPingGW-rule" -inf: not_defined 
>>> default_ping_set or default_ping_set lt 100
>>>  colocation r0_StartPgIfPingGW inf: msPostgresql clnPingCheck
>>>  colocation r1_MastersGroup inf: VirtualIP msPostgresql:Master
>>>  order rsc_order-1 0: clnPingCheck msPostgresql
>>>  order rsc_order-2 0: msPostgresql:promote VirtualIP:start symmetrical=false
>>>  order rsc_order-3 0: msPostgresql:demote VirtualIP:stop symmetrical=false
>>>  property $id="cib-bootstrap-options" \
>>>         dc-version="1.1.10-1.el6-368c726" \
>>>         cluster-infrastructure="corosync" \
>>>         stonith-enabled="false" \
>>>         no-quorum-policy="stop"
>>>  rsc_defaults $id="rsc-options" \
>>>         resource-stickiness="INFINITY" \
>>>         migration-threshold="1"
>>> 
>>>  Tell me where to look - why does pacemaker not work?
>> 
>> You might want to follow some of the steps at:
>> 
>>    http://blog.clusterlabs.org/blog/2013/debugging-pacemaker/
>> 
>> under the heading "Resource-level failures".
> 
> Yes. Thank you. 
> I've seen this article and now I study it in more detail.
> A lot of information in the logs, so it is difficult to determine where the 
> error is, and where the consequence of error.
> Now I'm trying to figure it out.
> 
> BUT...
> While I can say with certainty that the RA with monitor in the MS(pgsql) is 
> called ONLY on the node on which the last was launched PACEMAKER.

It looks like you're hitting 
https://github.com/beekhof/pacemaker/commit/58962338
Since you appear to be on rhel6 (or a clone of rhel6), can I suggest you use 
the 1.1.10 packages that come with 6.4?
They include the above patch.

Also, just to be sure. Are you expecting monitor operations to detect when you 
started a resource manually?
If so, you'll need a monitor operation with role=Stopped. We don't do that by 
default.


>> 
>> 'crm_mon -o' might be a good source of information too.
> Therefore, I see that my resources allegedly functioning normally.
> 
> # crm_mon -o1 
> Last updated: Tue Nov 12 09:27:16 2013
> Last change: Tue Nov 12 00:08:35 2013 via crm_attribute on dev-cluster2-node2
> Stack: corosync
> Current DC: dev-cluster2-node2 (172793105) - partition with quorum
> Version: 1.1.10-1.el6-368c726
> 3 Nodes configured
> 337 Resources configured
> 
> 
> Online: [ dev-cluster2-node2 dev-cluster2-node3 dev-cluster2-node4 ]
> 
> Clone Set: clonePing [pingCheck]
>     Started: [ dev-cluster2-node2 dev-cluster2-node3 dev-cluster2-node4 ]
> Master/Slave Set: msPgsql [pgsql]
>     Masters: [ dev-cluster2-node2 ]
>     Slaves: [ dev-cluster2-node3 dev-cluster2-node4 ]
> VirtualIP      (ocf::heartbeat:IPaddr2):       Started dev-cluster2-node2
> 
> Operations:
> * Node dev-cluster2-node2:
>   pingCheck: migration-threshold=1
>    + (20) start: rc=0 (ok)
>    + (23) monitor: interval=10000ms rc=0 (ok)
>   pgsql: migration-threshold=1
>    + (41) promote: rc=0 (ok)
>    + (87) monitor: interval=1000ms rc=8 (master)
>   VirtualIP: migration-threshold=1
>    + (49) start: rc=0 (ok)
>    + (52) monitor: interval=10000ms rc=0 (ok)
> * Node dev-cluster2-node3:
>   pingCheck: migration-threshold=1
>    + (20) start: rc=0 (ok)
>    + (23) monitor: interval=10000ms rc=0 (ok)
>   pgsql: migration-threshold=1
>    + (26) start: rc=0 (ok)
>    + (32) monitor: interval=10000ms rc=0 (ok)
> * Node dev-cluster2-node4:
>   pingCheck: migration-threshold=1
>    + (20) start: rc=0 (ok)
>    + (23) monitor: interval=10000ms rc=0 (ok)
>   pgsql: migration-threshold=1
>    + (26) start: rc=0 (ok)
>    + (32) monitor: interval=10000ms rc=0 (ok)
> 
> In reality now killed (signal 4|6) the PG master and the penultimate slave PG.
> IMHO, even if I have something configured incorrectly, the inability to 
> monitor the resource must cause a fatal error.
> Or is there a reason not to do so?
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] why pacemaker does not control the resources

Reply via email to