Hi Andrew, > > It is time when STONITH is carried out in the environment of two nodes by a > > standby node. > > > > A resource is started without waiting for completion of STONITH from a DC > > node. > > While STONITH is not completed, this problem happens if an active node fell. > > So let me see if I understand this correctly... > > You start with two healthy nodes. Yes.
> > You cause a resource on A to fail, at which point B tries to shoot it. Yes. > > The stonith op never completes and before it times out, you restart B. No. It is node A to reboot. - Node A is the one that node B is going to shoot. > > Resources get started on B. Yes. A dummy resource is started at the time of DC node B. When node B is not DC, it is not started. > > Questions: > > Is the above accurate? > Is only the dummy resource started, or are other ones started too? Yes. > When B comes up again, does it form a two-node cluster with A? > Is A still up or has it become the DC and shot itself? I do not confirm the state after node A rebooted. > Sorry, parsing error... I can't tell if you're saying the problem also > exists for clusters based on OpenAIS. > I think you're saying it does not happen if you use OpenAIS instead of > Heartbeat. Yes. The same problem does not occur in OpenAIS. In OpenAIS, transition of the start of the dummy resource seems to be stopped after a partner node disappeared. ----------------------------------------------------------------- Jan 9 13:34:30 ais-1 crmd: [16497]: info: ais_status_callback: status: ais-2 is now lost (was member) Jan 9 13:34:30 ais-1 crmd: [16497]: info: crm_update_peer: Node ais-2: id=1234 state=lost (new) addr=r(0) ip(192.168.70.60) r(1) ip(192.168.80.61) votes=1 born=3556 seen=3556 proc=00000000000000000000000000053312 Jan 9 13:34:30 ais-1 crmd: [16497]: notice: crm_calculate_quorum: Membership 10: quorum lost Jan 9 13:34:30 ais-1 crmd: [16497]: info: erase_node_from_join: Removed node ais-2 from join calculations: welcomed=0 itegrated=0 finalized=0 confirmed=1 Jan 9 13:34:30 ais-1 cib: [16493]: info: ais_dispatch: Processing membership 3560 Jan 9 13:34:30 ais-1 cib: [16493]: info: crm_update_peer: Node ais-2: id=1234 state=lost (new) addr=r(0) ip(192.168.70.60) r(1) ip(192.168.80.61) votes=1 born=3556 seen=3556 proc=00000000000000000000000000053312 Jan 9 13:34:30 ais-1 crmd: [16497]: info: crm_update_quorum: Updating quorum status to false (call=53) Jan 9 13:34:30 ais-1 openais[16486]: [TOTEM] entering GATHER state from 0. Jan 9 13:34:30 ais-1 cib: [16493]: notice: crm_calculate_quorum: Membership 0: quorum lost Jan 9 13:34:30 ais-1 openais[16486]: [TOTEM] Creating commit token because I am the rep. Jan 9 13:34:30 ais-1 openais[16486]: [TOTEM] Saving state aru 94 high seq received 94 Jan 9 13:34:30 ais-1 openais[16486]: [TOTEM] Storing new sequence id for ring de8 Jan 9 13:34:30 ais-1 cib: [16493]: info: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crmd/51): ok (rc=0) Jan 9 13:34:30 ais-1 openais[16486]: [TOTEM] entering COMMIT state. Jan 9 13:34:30 ais-1 cib: [16493]: info: cib_config_changed: Attr changes Jan 9 13:34:30 ais-1 openais[16486]: [TOTEM] entering RECOVERY state. Jan 9 13:34:30 ais-1 cib: [16493]: info: log_data_element: cib:diff: - <cib have-quorum="1" admin_epoch="0" epoch="1003" num_updates="10" /> Jan 9 13:34:30 ais-1 openais[16486]: [TOTEM] position [0] member 192.168.70.50: Jan 9 13:34:30 ais-1 cib: [16493]: info: log_data_element: cib:diff: + <cib have-quorum="0" admin_epoch="0" epoch="1004" num_updates="1" /> Jan 9 13:34:30 ais-1 openais[16486]: [TOTEM] previous ring seq 3556 rep 192.168.70.50 Jan 9 13:34:30 ais-1 cib: [16493]: info: cib_process_request: Operation complete: op cib_modify for section cib (origin=local/crmd/53): ok (rc=0) Jan 9 13:34:30 ais-1 openais[16486]: [TOTEM] aru 94 high delivered 94 received flag 1 Jan 9 13:34:30 ais-1 cib: [16493]: info: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crmd/54): ok (rc=0) Jan 9 13:34:30 ais-1 openais[16486]: [TOTEM] Did not need to originate any messages in recovery. Jan 9 13:34:30 ais-1 openais[16486]: [TOTEM] Sending initial ORF token Jan 9 13:34:30 ais-1 crmd: [16497]: info: abort_transition_graph: need_abort:60 - Triggered transition abort (complete=0) : Non-status change ----------------------------------------------------------------- Best Regards, Hideo Yamauchi. --- Andrew Beekhof <beek...@gmail.com> wrote: > On Wed, Jan 14, 2009 at 09:59, <renayama19661...@ybb.ne.jp> wrote: > > Hi, > > > >> > 1)I make it the state that a resource starts in a standby node. > >> > 2)I change it so that a stop error occurs in a dummy resource. > >> > 3)I generate the monitor error of the dummy resource in a standby > >> > node. > >> > 4)After a stop error, STONITH is carried out by a partner node. > >> > 5)Keep STONITH from a standby node waiting. > >> > 6)While STONITH is not completed, I reboot a standby node. > >> > >> Is this in a two-node cluster? > > Yes. > > > >> > Though STONITH from a DC node does not succeed, a resource is started. > >> > When STONITH did not succeed, the resource was not started at a non- > >> > DC node. > >> > >> I don't understand what you're saying here. > >> The first statement says a resource was started and the second says it > >> wasn't... they can't both be true. > > > > I'm sorry. > > It caused misunderstanding. > > > > It is time when STONITH is carried out in the environment of two nodes by a > > standby node. > > > > A resource is started without waiting for completion of STONITH from a DC > > node. > > While STONITH is not completed, this problem happens if an active node fell. > > So let me see if I understand this correctly... > > You start with two healthy nodes. > > You cause a resource on A to fail, at which point B tries to shoot it. > > The stonith op never completes and before it times out, you restart B. > > Resources get started on B. > > Questions: > > Is the above accurate? > Is only the dummy resource started, or are other ones started too? > When B comes up again, does it form a two-node cluster with A? > Is A still up or has it become the DC and shot itself? > > > > > I confirmed the same confirmation based on OpenAIS. > > However, in OpenAIS, the same problem did not occur. > > In OpenAIS, the start of the resource is evaded well. > > Sorry, parsing error... I can't tell if you're saying the problem also > exists for clusters based on OpenAIS. > I think you're saying it does not happen if you use OpenAIS instead of > Heartbeat. > > > > > --- Andrew Beekhof <beek...@gmail.com> wrote: > > > >> > >> On Jan 14, 2009, at 2:52 AM, <renayama19661...@ybb.ne.jp> > >> <renayama19661...@ybb.ne.jp > >> > wrote: > >> > >> > Hi, > >> > > >> > About movement of STONITH, I tested it. > >> > (heartbeat 2.99.2 + Pacemaker-1-0-6fd0eebd186e.tar.gz on > >> > RHEL5.2(i386VM)) > >> > > >> > When what I confirmed carries out STONITH from a DC node and a non- > >> > DC node. > >> > > >> > I confirmed it in the next flow. > >> > > >> > 1)I make it the state that a resource starts in a standby node. > >> > 2)I change it so that a stop error occurs in a dummy resource. > >> > 3)I generate the monitor error of the dummy resource in a standby > >> > node. > >> > 4)After a stop error, STONITH is carried out by a partner node. > >> > 5)Keep STONITH from a standby node waiting. > >> > 6)While STONITH is not completed, I reboot a standby node. > >> > >> Is this in a two-node cluster? > >> > >> > I watched log. > >> > >> > > >> > Though STONITH from a DC node does not succeed, a resource is started. > >> > When STONITH did not succeed, the resource was not started at a non- > >> > DC node. > >> > >> I don't understand what you're saying here. > >> The first statement says a resource was started and the second says it > >> wasn't... they can't both be true. > >> > >> > > >> > > >> > --------------------------------------------------------------------------- > >> > Jan 13 16:01:25 ais-1 crmd: [6003]: info: send_rsc_command: > >> > Initiating action 7: start > >> > prmDummy1_start_0 on ais-1 > >> > --------------------------------------------------------------------------- > >> > > >> > When STONITH did not succeed, I thought that the resource did not > >> > start. > >> > Does not the behavior when STONITH failed from a DC node have a > >> > problem? > >> > > >> > I attach a result of hb_report. > >> > - stonith_exec_dc.tar.gz (A result when STONITH was carried out by a > >> > DC node(ais-1)) > >> > - stonith_exec_nodc.tar.gz(A result when STONITH was carried out by > >> > a non-DC node(ais-1)) > > _______________________________________________ > Pacemaker mailing list > Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > _______________________________________________ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker