On Tue, Oct 5, 2010 at 6:44 AM, <renayama19661...@ybb.ne.jp> wrote: > Hi, > > We tested complicated node trouble. > > An error of "Election Timeout" occurred then. > > * Pacemaker:pacemaker-1.0.9.1 > * heartbeat-3.0.3-2.3.el5 > * cluster-glue:cluster-glue-1.0.6-1.6.el5 > * resource-agents-1.0.3-1.0.dev.b7a3b1973ba7 > > We tested it in the next procedure. > > Step1) Start all nodes. > Step2) In a cgl49 node, we generate a monitor error of prmApPostgreSQLDB1. > Step3) A cgl49 node is done STONITH of by a cgl54 node. > Step4) With Step3, we do kill of the master process of the cgl54 node. > Step5) A cgl54 node reboots. > Step6) A cgl49 node is done STONITH. > Step7) A cgl53 node is promoted to a DC node. > Step8) A cgl49 node is done STONITH of again. > However, because the cgl49 node has STONITH only from a cgl54 node, > STONITH does time-out and > does a loop. > > ============ > Last updated: Mon Aug 30 14:40:58 2010 > Stack: Heartbeat > Current DC: cgl53 (a07bcfc0-7aee-4382-9a2b-711b9c93e7e9) - partition WITHOUT > quorum > Version: 1.0.9-74392a28b7f3 stable-1.0 tip > 4 Nodes configured, unknown expected votes > 16 Resources configured. > ============ > > Node cgl49 (979c05ea-442b-4f53-9ba7-6cb7e82f30ac): UNCLEAN (offline) > Node cgl54 (9bea1025-3cbe-481f-830d-a24dfc7f0374): UNCLEAN (offline) > Online: [ cgl50 cgl53 ] > > Step9) When a cgl54 node restores, the election of the DC is performed, but > an error occurs here. > > * cgl50 node > crmd: [32110]: info: do_state_transition: State transition S_NOT_DC -> > S_ELECTION [ input=I_ELECTION > cause=C_FSA_INTERNAL origin=do_election_count_vote ] > crmd: [32110]: info: update_dc: Unset DC cgl53 > (snip) > cgl50 crmd: [32110]: ERROR: crm_timer_popped: Election Timeout > (I_ELECTION_DC) just popped! > > * cgl53 node > crmd: [1325]: info: do_state_transition: State transition S_INTEGRATION -> > S_ELECTION [ > input=I_ELECTION cause=C_FSA_INTERNAL origin=do_election_count_vote ] > cgl53 crmd: [1325]: info: update_dc: Unset DC cgl53 > (snip) > crmd: [1325]: ERROR: crm_timer_popped: Election Timeout (I_ELECTION_DC) just > popped! > (snip) > crmd: [1325]: ERROR: crm_timer_popped: Election Timeout (I_ELECTION_DC) just > popped! > (siip) > crmd: [1325]: info: crmd_ha_msg_filter: Another DC detected: cgl50 > (op=join_offer) > > > Step10) A cgl53 node becomes the "Pending" state. > And a cgl53 node becomes the "online" state after STONITH of the wait state > did time-out. > > Why is it that "Election Timeout" occurred?
Possibly the ccm membership hasn't fully recovered. > Why is it that a cgl53 node became the "Pending" state? this is usually when we know the node is up, but we couldn't complete the crm-level negotiation necessary for it to run resources. possibly its in a bad state waiting for something to start or its replies are being lost > Possibly this may be a problem of ccm. > In addition, the same problem may be already reported. > > > * Because a log file was big, I registered the same contents with Bugzilla. > * http://developerbugs.linux-foundation.org/show_bug.cgi?id=2502 ok, i'll follow up there _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker