11.11.2013, 03:44, "Andrew Beekhof" <and...@beekhof.net>: > On 8 Nov 2013, at 7:49 am, Andrey Groshev <gre...@yandex.ru> wrote: > >> Hi, PPL! >> I need help. I do not understand... Why has stopped working. >> This configuration work on other cluster, but on corosync1. >> >> So... cluster postgres with master/slave. >> Classic config as in wiki. >> I build cluster, start, he is working. >> Next I kill postgres on Master with 6 signal, as if "disk space left" >> >> # pkill -6 postgres >> # ps axuww|grep postgres >> root 9032 0.0 0.1 103236 860 pts/0 S+ 00:37 0:00 grep >> postgres >> >> PostgreSQL die, But crm_mon shows that the master is still running. >> >> Last updated: Fri Nov 8 00:42:08 2013 >> Last change: Fri Nov 8 00:37:05 2013 via crm_attribute on >> dev-cluster2-node4 >> Stack: corosync >> Current DC: dev-cluster2-node4 (172793107) - partition with quorum >> Version: 1.1.10-1.el6-368c726 >> 3 Nodes configured >> 7 Resources configured >> >> Node dev-cluster2-node2 (172793105): online >> pingCheck (ocf::pacemaker:ping): Started >> pgsql (ocf::heartbeat:pgsql): Started >> Node dev-cluster2-node3 (172793106): online >> pingCheck (ocf::pacemaker:ping): Started >> pgsql (ocf::heartbeat:pgsql): Started >> Node dev-cluster2-node4 (172793107): online >> pgsql (ocf::heartbeat:pgsql): Master >> pingCheck (ocf::pacemaker:ping): Started >> VirtualIP (ocf::heartbeat:IPaddr2): Started >> >> Node Attributes: >> * Node dev-cluster2-node2: >> + default_ping_set : 100 >> + master-pgsql : -INFINITY >> + pgsql-data-status : STREAMING|ASYNC >> + pgsql-status : HS:async >> * Node dev-cluster2-node3: >> + default_ping_set : 100 >> + master-pgsql : -INFINITY >> + pgsql-data-status : STREAMING|ASYNC >> + pgsql-status : HS:async >> * Node dev-cluster2-node4: >> + default_ping_set : 100 >> + master-pgsql : 1000 >> + pgsql-data-status : LATEST >> + pgsql-master-baseline : 0000000002000078 >> + pgsql-status : PRI >> >> Migration summary: >> * Node dev-cluster2-node4: >> * Node dev-cluster2-node2: >> * Node dev-cluster2-node3: >> >> Tickets: >> >> CONFIG: >> node $id="172793105" dev-cluster2-node2. \ >> attributes pgsql-data-status="STREAMING|ASYNC" standby="false" >> node $id="172793106" dev-cluster2-node3. \ >> attributes pgsql-data-status="STREAMING|ASYNC" standby="false" >> node $id="172793107" dev-cluster2-node4. \ >> attributes pgsql-data-status="LATEST" >> primitive VirtualIP ocf:heartbeat:IPaddr2 \ >> params ip="10.76.157.194" \ >> op start interval="0" timeout="60s" on-fail="stop" \ >> op monitor interval="10s" timeout="60s" on-fail="restart" \ >> op stop interval="0" timeout="60s" on-fail="block" >> primitive pgsql ocf:heartbeat:pgsql \ >> params pgctl="/usr/pgsql-9.1/bin/pg_ctl" >> psql="/usr/pgsql-9.1/bin/psql" pgdata="/var/lib/pgsql/9.1/data" >> tmpdir="/tmp/pg" start_opt="-p 5432" >> logfile="/var/lib/pgsql/9.1//pgstartup.log" rep_mode="async" node_list=" >> dev-cluster2-node2. dev-cluster2-node3. dev-cluster2-node4. " >> restore_command="gzip -cd >> /var/backup/pitr/dev-cluster2-master#5432/xlog/%f.gz > %p" >> primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 >> keepalives_count=5" master_ip="10.76.157.194" \ >> op start interval="0" timeout="60s" on-fail="restart" \ >> op monitor interval="5s" timeout="61s" on-fail="restart" \ >> op monitor interval="1s" role="Master" timeout="62s" >> on-fail="restart" \ >> op promote interval="0" timeout="63s" on-fail="restart" \ >> op demote interval="0" timeout="64s" on-fail="stop" \ >> op stop interval="0" timeout="65s" on-fail="block" \ >> op notify interval="0" timeout="66s" >> primitive pingCheck ocf:pacemaker:ping \ >> params name="default_ping_set" host_list="10.76.156.1" >> multiplier="100" \ >> op start interval="0" timeout="60s" on-fail="restart" \ >> op monitor interval="10s" timeout="60s" on-fail="restart" \ >> op stop interval="0" timeout="60s" on-fail="ignore" >> ms msPostgresql pgsql \ >> meta master-max="1" master-node-max="1" clone-node-max="1" >> notify="true" target-role="Master" clone-max="3" >> clone clnPingCheck pingCheck \ >> meta clone-max="3" >> location l0_DontRunPgIfNotPingGW msPostgresql \ >> rule $id="l0_DontRunPgIfNotPingGW-rule" -inf: not_defined >> default_ping_set or default_ping_set lt 100 >> colocation r0_StartPgIfPingGW inf: msPostgresql clnPingCheck >> colocation r1_MastersGroup inf: VirtualIP msPostgresql:Master >> order rsc_order-1 0: clnPingCheck msPostgresql >> order rsc_order-2 0: msPostgresql:promote VirtualIP:start symmetrical=false >> order rsc_order-3 0: msPostgresql:demote VirtualIP:stop symmetrical=false >> property $id="cib-bootstrap-options" \ >> dc-version="1.1.10-1.el6-368c726" \ >> cluster-infrastructure="corosync" \ >> stonith-enabled="false" \ >> no-quorum-policy="stop" >> rsc_defaults $id="rsc-options" \ >> resource-stickiness="INFINITY" \ >> migration-threshold="1" >> >> Tell me where to look - why does pacemaker not work? > > You might want to follow some of the steps at: > > http://blog.clusterlabs.org/blog/2013/debugging-pacemaker/ > > under the heading "Resource-level failures".
Yes. Thank you. I've seen this article and now I study it in more detail. A lot of information in the logs, so it is difficult to determine where the error is, and where the consequence of error. Now I'm trying to figure it out. BUT... While I can say with certainty that the RA with monitor in the MS(pgsql) is called ONLY on the node on which the last was launched PACEMAKER. > > 'crm_mon -o' might be a good source of information too. Therefore, I see that my resources allegedly functioning normally. # crm_mon -o1 Last updated: Tue Nov 12 09:27:16 2013 Last change: Tue Nov 12 00:08:35 2013 via crm_attribute on dev-cluster2-node2 Stack: corosync Current DC: dev-cluster2-node2 (172793105) - partition with quorum Version: 1.1.10-1.el6-368c726 3 Nodes configured 337 Resources configured Online: [ dev-cluster2-node2 dev-cluster2-node3 dev-cluster2-node4 ] Clone Set: clonePing [pingCheck] Started: [ dev-cluster2-node2 dev-cluster2-node3 dev-cluster2-node4 ] Master/Slave Set: msPgsql [pgsql] Masters: [ dev-cluster2-node2 ] Slaves: [ dev-cluster2-node3 dev-cluster2-node4 ] VirtualIP (ocf::heartbeat:IPaddr2): Started dev-cluster2-node2 Operations: * Node dev-cluster2-node2: pingCheck: migration-threshold=1 + (20) start: rc=0 (ok) + (23) monitor: interval=10000ms rc=0 (ok) pgsql: migration-threshold=1 + (41) promote: rc=0 (ok) + (87) monitor: interval=1000ms rc=8 (master) VirtualIP: migration-threshold=1 + (49) start: rc=0 (ok) + (52) monitor: interval=10000ms rc=0 (ok) * Node dev-cluster2-node3: pingCheck: migration-threshold=1 + (20) start: rc=0 (ok) + (23) monitor: interval=10000ms rc=0 (ok) pgsql: migration-threshold=1 + (26) start: rc=0 (ok) + (32) monitor: interval=10000ms rc=0 (ok) * Node dev-cluster2-node4: pingCheck: migration-threshold=1 + (20) start: rc=0 (ok) + (23) monitor: interval=10000ms rc=0 (ok) pgsql: migration-threshold=1 + (26) start: rc=0 (ok) + (32) monitor: interval=10000ms rc=0 (ok) In reality now killed (signal 4|6) the PG master and the penultimate slave PG. IMHO, even if I have something configured incorrectly, the inability to monitor the resource must cause a fatal error. Or is there a reason not to do so? _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org