On 12 Nov 2013, at 4:42 pm, Andrey Groshev <gre...@yandex.ru> wrote:
> > > 11.11.2013, 03:44, "Andrew Beekhof" <and...@beekhof.net>: >> On 8 Nov 2013, at 7:49 am, Andrey Groshev <gre...@yandex.ru> wrote: >> >>> Hi, PPL! >>> I need help. I do not understand... Why has stopped working. >>> This configuration work on other cluster, but on corosync1. >>> >>> So... cluster postgres with master/slave. >>> Classic config as in wiki. >>> I build cluster, start, he is working. >>> Next I kill postgres on Master with 6 signal, as if "disk space left" >>> >>> # pkill -6 postgres >>> # ps axuww|grep postgres >>> root 9032 0.0 0.1 103236 860 pts/0 S+ 00:37 0:00 grep >>> postgres >>> >>> PostgreSQL die, But crm_mon shows that the master is still running. >>> >>> Last updated: Fri Nov 8 00:42:08 2013 >>> Last change: Fri Nov 8 00:37:05 2013 via crm_attribute on >>> dev-cluster2-node4 >>> Stack: corosync >>> Current DC: dev-cluster2-node4 (172793107) - partition with quorum >>> Version: 1.1.10-1.el6-368c726 >>> 3 Nodes configured >>> 7 Resources configured >>> >>> Node dev-cluster2-node2 (172793105): online >>> pingCheck (ocf::pacemaker:ping): Started >>> pgsql (ocf::heartbeat:pgsql): Started >>> Node dev-cluster2-node3 (172793106): online >>> pingCheck (ocf::pacemaker:ping): Started >>> pgsql (ocf::heartbeat:pgsql): Started >>> Node dev-cluster2-node4 (172793107): online >>> pgsql (ocf::heartbeat:pgsql): Master >>> pingCheck (ocf::pacemaker:ping): Started >>> VirtualIP (ocf::heartbeat:IPaddr2): Started >>> >>> Node Attributes: >>> * Node dev-cluster2-node2: >>> + default_ping_set : 100 >>> + master-pgsql : -INFINITY >>> + pgsql-data-status : STREAMING|ASYNC >>> + pgsql-status : HS:async >>> * Node dev-cluster2-node3: >>> + default_ping_set : 100 >>> + master-pgsql : -INFINITY >>> + pgsql-data-status : STREAMING|ASYNC >>> + pgsql-status : HS:async >>> * Node dev-cluster2-node4: >>> + default_ping_set : 100 >>> + master-pgsql : 1000 >>> + pgsql-data-status : LATEST >>> + pgsql-master-baseline : 0000000002000078 >>> + pgsql-status : PRI >>> >>> Migration summary: >>> * Node dev-cluster2-node4: >>> * Node dev-cluster2-node2: >>> * Node dev-cluster2-node3: >>> >>> Tickets: >>> >>> CONFIG: >>> node $id="172793105" dev-cluster2-node2. \ >>> attributes pgsql-data-status="STREAMING|ASYNC" standby="false" >>> node $id="172793106" dev-cluster2-node3. \ >>> attributes pgsql-data-status="STREAMING|ASYNC" standby="false" >>> node $id="172793107" dev-cluster2-node4. \ >>> attributes pgsql-data-status="LATEST" >>> primitive VirtualIP ocf:heartbeat:IPaddr2 \ >>> params ip="10.76.157.194" \ >>> op start interval="0" timeout="60s" on-fail="stop" \ >>> op monitor interval="10s" timeout="60s" on-fail="restart" \ >>> op stop interval="0" timeout="60s" on-fail="block" >>> primitive pgsql ocf:heartbeat:pgsql \ >>> params pgctl="/usr/pgsql-9.1/bin/pg_ctl" >>> psql="/usr/pgsql-9.1/bin/psql" pgdata="/var/lib/pgsql/9.1/data" >>> tmpdir="/tmp/pg" start_opt="-p 5432" >>> logfile="/var/lib/pgsql/9.1//pgstartup.log" rep_mode="async" node_list=" >>> dev-cluster2-node2. dev-cluster2-node3. dev-cluster2-node4. " >>> restore_command="gzip -cd >>> /var/backup/pitr/dev-cluster2-master#5432/xlog/%f.gz > %p" >>> primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 >>> keepalives_count=5" master_ip="10.76.157.194" \ >>> op start interval="0" timeout="60s" on-fail="restart" \ >>> op monitor interval="5s" timeout="61s" on-fail="restart" \ >>> op monitor interval="1s" role="Master" timeout="62s" >>> on-fail="restart" \ >>> op promote interval="0" timeout="63s" on-fail="restart" \ >>> op demote interval="0" timeout="64s" on-fail="stop" \ >>> op stop interval="0" timeout="65s" on-fail="block" \ >>> op notify interval="0" timeout="66s" >>> primitive pingCheck ocf:pacemaker:ping \ >>> params name="default_ping_set" host_list="10.76.156.1" >>> multiplier="100" \ >>> op start interval="0" timeout="60s" on-fail="restart" \ >>> op monitor interval="10s" timeout="60s" on-fail="restart" \ >>> op stop interval="0" timeout="60s" on-fail="ignore" >>> ms msPostgresql pgsql \ >>> meta master-max="1" master-node-max="1" clone-node-max="1" >>> notify="true" target-role="Master" clone-max="3" >>> clone clnPingCheck pingCheck \ >>> meta clone-max="3" >>> location l0_DontRunPgIfNotPingGW msPostgresql \ >>> rule $id="l0_DontRunPgIfNotPingGW-rule" -inf: not_defined >>> default_ping_set or default_ping_set lt 100 >>> colocation r0_StartPgIfPingGW inf: msPostgresql clnPingCheck >>> colocation r1_MastersGroup inf: VirtualIP msPostgresql:Master >>> order rsc_order-1 0: clnPingCheck msPostgresql >>> order rsc_order-2 0: msPostgresql:promote VirtualIP:start symmetrical=false >>> order rsc_order-3 0: msPostgresql:demote VirtualIP:stop symmetrical=false >>> property $id="cib-bootstrap-options" \ >>> dc-version="1.1.10-1.el6-368c726" \ >>> cluster-infrastructure="corosync" \ >>> stonith-enabled="false" \ >>> no-quorum-policy="stop" >>> rsc_defaults $id="rsc-options" \ >>> resource-stickiness="INFINITY" \ >>> migration-threshold="1" >>> >>> Tell me where to look - why does pacemaker not work? >> >> You might want to follow some of the steps at: >> >> http://blog.clusterlabs.org/blog/2013/debugging-pacemaker/ >> >> under the heading "Resource-level failures". > > Yes. Thank you. > I've seen this article and now I study it in more detail. > A lot of information in the logs, so it is difficult to determine where the > error is, and where the consequence of error. > Now I'm trying to figure it out. > > BUT... > While I can say with certainty that the RA with monitor in the MS(pgsql) is > called ONLY on the node on which the last was launched PACEMAKER. It looks like you're hitting https://github.com/beekhof/pacemaker/commit/58962338 Since you appear to be on rhel6 (or a clone of rhel6), can I suggest you use the 1.1.10 packages that come with 6.4? They include the above patch. Also, just to be sure. Are you expecting monitor operations to detect when you started a resource manually? If so, you'll need a monitor operation with role=Stopped. We don't do that by default. >> >> 'crm_mon -o' might be a good source of information too. > Therefore, I see that my resources allegedly functioning normally. > > # crm_mon -o1 > Last updated: Tue Nov 12 09:27:16 2013 > Last change: Tue Nov 12 00:08:35 2013 via crm_attribute on dev-cluster2-node2 > Stack: corosync > Current DC: dev-cluster2-node2 (172793105) - partition with quorum > Version: 1.1.10-1.el6-368c726 > 3 Nodes configured > 337 Resources configured > > > Online: [ dev-cluster2-node2 dev-cluster2-node3 dev-cluster2-node4 ] > > Clone Set: clonePing [pingCheck] > Started: [ dev-cluster2-node2 dev-cluster2-node3 dev-cluster2-node4 ] > Master/Slave Set: msPgsql [pgsql] > Masters: [ dev-cluster2-node2 ] > Slaves: [ dev-cluster2-node3 dev-cluster2-node4 ] > VirtualIP (ocf::heartbeat:IPaddr2): Started dev-cluster2-node2 > > Operations: > * Node dev-cluster2-node2: > pingCheck: migration-threshold=1 > + (20) start: rc=0 (ok) > + (23) monitor: interval=10000ms rc=0 (ok) > pgsql: migration-threshold=1 > + (41) promote: rc=0 (ok) > + (87) monitor: interval=1000ms rc=8 (master) > VirtualIP: migration-threshold=1 > + (49) start: rc=0 (ok) > + (52) monitor: interval=10000ms rc=0 (ok) > * Node dev-cluster2-node3: > pingCheck: migration-threshold=1 > + (20) start: rc=0 (ok) > + (23) monitor: interval=10000ms rc=0 (ok) > pgsql: migration-threshold=1 > + (26) start: rc=0 (ok) > + (32) monitor: interval=10000ms rc=0 (ok) > * Node dev-cluster2-node4: > pingCheck: migration-threshold=1 > + (20) start: rc=0 (ok) > + (23) monitor: interval=10000ms rc=0 (ok) > pgsql: migration-threshold=1 > + (26) start: rc=0 (ok) > + (32) monitor: interval=10000ms rc=0 (ok) > > In reality now killed (signal 4|6) the PG master and the penultimate slave PG. > IMHO, even if I have something configured incorrectly, the inability to > monitor the resource must cause a fatal error. > Or is there a reason not to do so? > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org