Hi, 2014-02-24 10:49 GMT+09:00 Andrew Beekhof <and...@beekhof.net>: > > On 21 Feb 2014, at 2:19 pm, Andrew Beekhof <and...@beekhof.net> wrote: > >> >> On 18 Feb 2014, at 1:23 pm, Andrew Beekhof <and...@beekhof.net> wrote: >> >>> >>> On 6 Feb 2014, at 7:45 pm, Keisuke MORI <keisuke.mori...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> I observed that pre_notify_demote is issued twice when a master >>>> resource is migrating. >>>> I'm wondering if this is the correct behavior. >>>> >>>> Steps to reproduce: >>>> >>>> - Start up 2 nodes cluster configured for the PostgreSQL streaming >>>> replication using pgsql RA as a master/slave resource. >>>> - kill the postgresql process on the master node to induce a fail-over. >>>> - The fail-over succeeds as expected, but pre_notify_demote was >>>> executed twice on each node before demoting on the master resource. >>>> >>>> 100% reproducible on my cluster. >>>> >>>> Pacemaker version: 1.1.11-rc4 (source build from the repo) >>>> OS: RHEL6.4 >>>> >>>> I have never seen this on Pacemaker-1.0.* cluster with the same >>>> configuration. >>>> >>>> The relevant logs and pe-inputs are attached. >>>> >>>> >>>> Diagnostics: >>>> >>>> (1) The first transition caused by the process failure (pe-input-160) >>>> initiates pre_notify_demote on both nodes and cancelling slave monitor >>>> on the slave node. >>>> {{{ >>>> 171 Jan 30 16:08:59 rhel64-1 crmd[8143]: notice: te_rsc_command: >>>> Initiating action 9: cancel prmPostgresql_cancel_10000 on rhel64-2 >>>> 172 Jan 30 16:08:59 rhel64-1 crmd[8143]: notice: te_rsc_command: >>>> Initiating action 79: notify prmPostgresql_pre_notify_demote_0 on >>>> rhel64-1 (local) >>>> >>>> 175 Jan 30 16:08:59 rhel64-1 crmd[8143]: notice: te_rsc_command: >>>> Initiating action 81: notify prmPostgresql_pre_notify_demote_0 on >>>> rhel64-2 >>>> }}} >>>> >>>> (2) When cancelling slave monitor completes, the transition is aborted >>>> by Resource op removal. >>>> {{{ >>>> 176 Jan 30 16:08:59 rhel64-1 crmd[8143]: info: match_graph_event: >>>> Action prmPostgresql_monitor_10000 (9) confirmed on rhel64-2 (rc=0) >>>> 177 Jan 30 16:08:59 rhel64-1 cib[8138]: info: cib_process_request: >>>> Completed cib_delete operation for section status: OK (rc=0, >>>> origin=rhel64-2/crmd/21, version=0.37.9) >>>> 178 Jan 30 16:08:59 rhel64-1 crmd[8143]: info: >>>> abort_transition_graph: te_update_diff:258 - Triggered transition >>>> abort (complete=0, node=rhel64-2, tag=lrm_rsc_op, >>>> id=prmPostgresql_monitor_10000, >>>> magic=0:0;26:12:0:acf9a2a3-307c-460b-b786-fc20e6b8aad5, cib=0.37.9) : >>>> Resource op removal >>>> }}} >>>> >>>> (3) The second transition is calculated by the abort (pe-input-161) >>>> which results initiating pre_notify_demote again. >>> >>> If the demote didn't complete (or wasn't even attempted), then we must send >>> the pre_notify_demote again unfortunately. >>> The real bug may well be that the transition shouldn't have been aborted. >> >> It looks legitimate: >> >> Jan 30 16:08:59 rhel64-1 crmd[8143]: info: abort_transition_graph: >> te_update_diff:258 - Triggered transition abort (complete=0, node=rhel64-2, >> tag=lrm_rsc_op, id=prmPostgresql_monitor_10000, >> magic=0:0;26:12:0:acf9a2a3-307c-460b-b786-fc20e6b8aad5, cib=0.37.9) : >> Resource op removal > > It looks like get_cancel_action() was not functioning correctly: > > https://github.com/beekhof/pacemaker/commit/9d77c99
Thanks for looking into it. I have confirmed that the issue is now resolved with the recent revision on your repo. at: https://github.com/beekhof/pacemaker/commit/04ff1bd2d144e7defd6f1f67f6bde6fa95c428e1 Thanks! -- Keisuke MORI > >> >> Jan 30 16:08:59 rhel64-1 cib[8138]: info: cib_process_request: Completed >> cib_delete operation for section status: OK (rc=0, origin=rhel64-2/crmd/21, >> version=0.37.9) >> >> It looks like part of the node status entry being removed for rhel64-2. >> Possibly as a result of: >> >> Jan 30 16:07:54 rhel64-2 crmd[25070]: info: erase_status_tag: Deleting >> xpath: //node_state[@uname='rhel64-2']/transient_attributes >> >> The new cib code, being much faster, might help here too :) >> >>> >>>> {{{ >>>> 227 Jan 30 16:09:01 rhel64-1 pengine[8142]: notice: >>>> process_pe_message: Calculated Transition 15: >>>> /var/lib/pacemaker/pengine/pe-input-161.bz2 >>>> 229 Jan 30 16:09:01 rhel64-1 crmd[8143]: notice: te_rsc_command: >>>> Initiating action 78: notify prmPostgresql_pre_notify_demote_0 on >>>> rhel64-1 (local) >>>> 232 Jan 30 16:09:01 rhel64-1 crmd[8143]: notice: te_rsc_command: >>>> Initiating action 80: notify prmPostgresql_pre_notify_demote_0 on >>>> rhel64-2 >>>> }}} >>>> >>>> I think that the transition abort at (2) should not happen. >>>> >>>> Regards, >>>> -- >>>> Keisuke MORI >>>> <logs-pre-notify-20140206.tar.bz2>_______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org