On Jan 15, 2009, at 1:55 AM, <renayama19661...@ybb.ne.jp> <renayama19661...@ybb.ne.jp > wrote:
> Hi Andrew, > >>> It is time when STONITH is carried out in the environment of two >>> nodes by a standby node. >>> >>> A resource is started without waiting for completion of STONITH >>> from a DC node. >>> While STONITH is not completed, this problem happens if an active >>> node fell. >> >> So let me see if I understand this correctly... >> >> You start with two healthy nodes. > Yes. > >> >> You cause a resource on A to fail, at which point B tries to shoot >> it. > Yes. > >> >> The stonith op never completes and before it times out, you restart >> B. > No. > It is node A to reboot. > - Node A is the one that node B is going to shoot. Ah! Can you log a bug for this please? > > >> >> Resources get started on B. > Yes. > A dummy resource is started at the time of DC node B. > When node B is not DC, it is not started. > >> >> Questions: >> >> Is the above accurate? >> Is only the dummy resource started, or are other ones started too? > Yes. There were two alternatives in that question, the answer cant be "yes" :) > > >> When B comes up again, does it form a two-node cluster with A? >> Is A still up or has it become the DC and shot itself? > I do not confirm the state after node A rebooted. > >> Sorry, parsing error... I can't tell if you're saying the problem >> also >> exists for clusters based on OpenAIS. >> I think you're saying it does not happen if you use OpenAIS instead >> of >> Heartbeat. > Yes. > The same problem does not occur in OpenAIS. Excellent > > > In OpenAIS, transition of the start of the dummy resource seems to > be stopped after a partner node > disappeared. > > ----------------------------------------------------------------- > Jan 9 13:34:30 ais-1 crmd: [16497]: info: ais_status_callback: > status: ais-2 is now lost (was member) > Jan 9 13:34:30 ais-1 crmd: [16497]: info: crm_update_peer: Node > ais-2: id=1234 state=lost (new) > addr=r(0) ip(192.168.70.60) r(1) ip(192.168.80.61) votes=1 > born=3556 seen=3556 > proc=00000000000000000000000000053312 > Jan 9 13:34:30 ais-1 crmd: [16497]: notice: crm_calculate_quorum: > Membership 10: quorum lost > Jan 9 13:34:30 ais-1 crmd: [16497]: info: erase_node_from_join: > Removed node ais-2 from join > calculations: welcomed=0 itegrated=0 finalized=0 confirmed=1 > Jan 9 13:34:30 ais-1 cib: [16493]: info: ais_dispatch: Processing > membership 3560 > Jan 9 13:34:30 ais-1 cib: [16493]: info: crm_update_peer: Node > ais-2: id=1234 state=lost (new) > addr=r(0) ip(192.168.70.60) r(1) ip(192.168.80.61) votes=1 > born=3556 seen=3556 > proc=00000000000000000000000000053312 > Jan 9 13:34:30 ais-1 crmd: [16497]: info: crm_update_quorum: > Updating quorum status to false > (call=53) > Jan 9 13:34:30 ais-1 openais[16486]: [TOTEM] entering GATHER state > from 0. > Jan 9 13:34:30 ais-1 cib: [16493]: notice: crm_calculate_quorum: > Membership 0: quorum lost > Jan 9 13:34:30 ais-1 openais[16486]: [TOTEM] Creating commit token > because I am the rep. > Jan 9 13:34:30 ais-1 openais[16486]: [TOTEM] Saving state aru 94 > high seq received 94 > Jan 9 13:34:30 ais-1 openais[16486]: [TOTEM] Storing new sequence > id for ring de8 > Jan 9 13:34:30 ais-1 cib: [16493]: info: cib_process_request: > Operation complete: op cib_modify for > section nodes (origin=local/crmd/51): ok (rc=0) > Jan 9 13:34:30 ais-1 openais[16486]: [TOTEM] entering COMMIT state. > Jan 9 13:34:30 ais-1 cib: [16493]: info: cib_config_changed: Attr > changes > Jan 9 13:34:30 ais-1 openais[16486]: [TOTEM] entering RECOVERY state. > Jan 9 13:34:30 ais-1 cib: [16493]: info: log_data_element: > cib:diff: - <cib have-quorum="1" > admin_epoch="0" epoch="1003" num_updates="10" /> > Jan 9 13:34:30 ais-1 openais[16486]: [TOTEM] position [0] member > 192.168.70.50: > Jan 9 13:34:30 ais-1 cib: [16493]: info: log_data_element: > cib:diff: + <cib have-quorum="0" > admin_epoch="0" epoch="1004" num_updates="1" /> > Jan 9 13:34:30 ais-1 openais[16486]: [TOTEM] previous ring seq 3556 > rep 192.168.70.50 > Jan 9 13:34:30 ais-1 cib: [16493]: info: cib_process_request: > Operation complete: op cib_modify for > section cib (origin=local/crmd/53): ok (rc=0) > Jan 9 13:34:30 ais-1 openais[16486]: [TOTEM] aru 94 high delivered > 94 received flag 1 > Jan 9 13:34:30 ais-1 cib: [16493]: info: cib_process_request: > Operation complete: op cib_modify for > section nodes (origin=local/crmd/54): ok (rc=0) > Jan 9 13:34:30 ais-1 openais[16486]: [TOTEM] Did not need to > originate any messages in recovery. > Jan 9 13:34:30 ais-1 openais[16486]: [TOTEM] Sending initial ORF > token > Jan 9 13:34:30 ais-1 crmd: [16497]: info: abort_transition_graph: > need_abort:60 - Triggered > transition abort (complete=0) : Non-status change > ----------------------------------------------------------------- > > Best Regards, > Hideo Yamauchi. > > > --- Andrew Beekhof <beek...@gmail.com> wrote: > >> On Wed, Jan 14, 2009 at 09:59, <renayama19661...@ybb.ne.jp> wrote: >>> Hi, >>> >>>>> 1)I make it the state that a resource starts in a standby node. >>>>> 2)I change it so that a stop error occurs in a dummy resource. >>>>> 3)I generate the monitor error of the dummy resource in a standby >>>>> node. >>>>> 4)After a stop error, STONITH is carried out by a partner node. >>>>> 5)Keep STONITH from a standby node waiting. >>>>> 6)While STONITH is not completed, I reboot a standby node. >>>> >>>> Is this in a two-node cluster? >>> Yes. >>> >>>>> Though STONITH from a DC node does not succeed, a resource is >>>>> started. >>>>> When STONITH did not succeed, the resource was not started at a >>>>> non- >>>>> DC node. >>>> >>>> I don't understand what you're saying here. >>>> The first statement says a resource was started and the second >>>> says it >>>> wasn't... they can't both be true. >>> >>> I'm sorry. >>> It caused misunderstanding. >>> >>> It is time when STONITH is carried out in the environment of two >>> nodes by a standby node. >>> >>> A resource is started without waiting for completion of STONITH >>> from a DC node. >>> While STONITH is not completed, this problem happens if an active >>> node fell. >> >> So let me see if I understand this correctly... >> >> You start with two healthy nodes. >> >> You cause a resource on A to fail, at which point B tries to shoot >> it. >> >> The stonith op never completes and before it times out, you restart >> B. >> >> Resources get started on B. >> >> Questions: >> >> Is the above accurate? >> Is only the dummy resource started, or are other ones started too? >> When B comes up again, does it form a two-node cluster with A? >> Is A still up or has it become the DC and shot itself? >> >>> >>> I confirmed the same confirmation based on OpenAIS. >>> However, in OpenAIS, the same problem did not occur. >>> In OpenAIS, the start of the resource is evaded well. >> >> Sorry, parsing error... I can't tell if you're saying the problem >> also >> exists for clusters based on OpenAIS. >> I think you're saying it does not happen if you use OpenAIS instead >> of >> Heartbeat. >> >>> >>> --- Andrew Beekhof <beek...@gmail.com> wrote: >>> >>>> >>>> On Jan 14, 2009, at 2:52 AM, <renayama19661...@ybb.ne.jp> >>>> <renayama19661...@ybb.ne.jp >>>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> About movement of STONITH, I tested it. >>>>> (heartbeat 2.99.2 + Pacemaker-1-0-6fd0eebd186e.tar.gz on >>>>> RHEL5.2(i386VM)) >>>>> >>>>> When what I confirmed carries out STONITH from a DC node and a >>>>> non- >>>>> DC node. >>>>> >>>>> I confirmed it in the next flow. >>>>> >>>>> 1)I make it the state that a resource starts in a standby node. >>>>> 2)I change it so that a stop error occurs in a dummy resource. >>>>> 3)I generate the monitor error of the dummy resource in a standby >>>>> node. >>>>> 4)After a stop error, STONITH is carried out by a partner node. >>>>> 5)Keep STONITH from a standby node waiting. >>>>> 6)While STONITH is not completed, I reboot a standby node. >>>> >>>> Is this in a two-node cluster? >>>> >>>>> I watched log. >>>> >>>>> >>>>> Though STONITH from a DC node does not succeed, a resource is >>>>> started. >>>>> When STONITH did not succeed, the resource was not started at a >>>>> non- >>>>> DC node. >>>> >>>> I don't understand what you're saying here. >>>> The first statement says a resource was started and the second >>>> says it >>>> wasn't... they can't both be true. >>>> >>>>> >>>>> >>>>> --------------------------------------------------------------------------- >>>>> Jan 13 16:01:25 ais-1 crmd: [6003]: info: send_rsc_command: >>>>> Initiating action 7: start >>>>> prmDummy1_start_0 on ais-1 >>>>> --------------------------------------------------------------------------- >>>>> >>>>> When STONITH did not succeed, I thought that the resource did not >>>>> start. >>>>> Does not the behavior when STONITH failed from a DC node have a >>>>> problem? >>>>> >>>>> I attach a result of hb_report. >>>>> - stonith_exec_dc.tar.gz (A result when STONITH was carried out >>>>> by a >>>>> DC node(ais-1)) >>>>> - stonith_exec_nodc.tar.gz(A result when STONITH was carried out >>>>> by >>>>> a non-DC node(ais-1)) >> >> _______________________________________________ >> Pacemaker mailing list >> Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> > > > _______________________________________________ > Pacemaker mailing list > Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker _______________________________________________ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker