On Thu, Feb 23, 2012 at 10:37 PM, Philippe Carbonnier <philippe.carbonn...@vif.fr> wrote: > Hello Andreas and everybody, > > Even with the new timeout : totem token: 5000, sometimes, when jboss1 goes > standby, I've this errors : cib: [12758]: ERROR: send_ais_text: Sending > message 194: FAILED (rc=2): Library error: Connection timed out (110) and > after cib: [12758]: ERROR: send_ais_message: Not connected to AIS
Any openais logs from about that time? > > > Feb 22 14:13:45 ujboss1 crmd: [12762]: info: run_graph: > ==================================================== > Feb 22 14:13:45 ujboss1 crmd: [12762]: notice: run_graph: Transition 3285 > (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, > Source=/var/lib/pengine/pe-input-11116.bz2): Complete > Feb 22 14:13:45 ujboss1 crmd: [12762]: info: te_graph_trigger: Transition > 3285 is now complete > Feb 22 14:13:45 ujboss1 crmd: [12762]: info: notify_crmd: Transition 3285 > status: done - <null> > Feb 22 14:13:45 ujboss1 crmd: [12762]: info: do_state_transition: State > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS > cause=C_FSA_INTERNAL origin=notify_crmd ] > Feb 22 14:13:45 ujboss1 crmd: [12762]: info: do_state_transition: Starting > PEngine Recheck Timer > Feb 22 14:13:45 ujboss1 pengine: [12761]: info: process_pe_message: > Transition 3285: PEngine Input stored in: > /var/lib/pengine/pe-input-11116.bz2 > Feb 22 14:17:50 ujboss1 cib: [12758]: info: cib_stats: Processed 1 > operations (0.00us average, 0% utilization) in the last 10min > Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff: - > <cib admin_epoch="0" epoch="291" num_updates="1" > > Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff: - > <configuration > > Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff: > - <nodes > > Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff: > - <node id="ujboss1" > > Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff: > - <instance_attributes id="nodes-ujboss1" > > Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff: > - <nvpair value="off" id="nodes-ujboss1-standby" /> > Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff: > - </instance_attributes> > Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff: > - </node> > Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff: > - </nodes> > Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff: - > </configuration> > Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff: - > </cib> > Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff: + > <cib admin_epoch="0" epoch="292" num_updates="1" > > Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff: + > <configuration > > Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff: > + <nodes > > Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff: > + <node id="ujboss1" > > Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff: > + <instance_attributes id="nodes-ujboss1" > > Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff: > + <nvpair value="on" id="nodes-ujboss1-standby" /> > Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff: > + </instance_attributes> > Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff: > + </node> > Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff: > + </nodes> > Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff: + > </configuration> > Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff: + > </cib> > Feb 22 14:17:56 ujboss1 cib: [12758]: info: cib_process_request: Operation > complete: op cib_modify for section nodes (origin=local/crm_attribute/4, > version=0.292.1): ok (rc=0) > Feb 22 14:17:56 ujboss1 crmd: [12762]: info: abort_transition_graph: > need_abort:59 - Triggered transition abort (complete=1) : Non-status change > Feb 22 14:17:56 ujboss1 crmd: [12762]: info: need_abort: Aborting on change > to admin_epoch > Feb 22 14:17:56 ujboss1 crmd: [12762]: info: do_state_transition: State > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL > origin=abort_transition_graph ] > Feb 22 14:17:56 ujboss1 crmd: [12762]: info: do_state_transition: All 2 > cluster nodes are eligible to run resources. > Feb 22 14:17:56 ujboss1 crmd: [12762]: info: do_pe_invoke: Query 3407: > Requesting the current CIB: S_POLICY_ENGINE > Feb 22 14:17:58 ujboss1 cib: [12758]: ERROR: send_ais_text: Sending message > 194: FAILED (rc=2): Library error: Connection timed out (110) > Feb 22 14:17:59 ujboss1 crmd: [12762]: info: do_pe_invoke_callback: Invoking > the PE: query=3407, ref=pe_calc-dc-1329916679-3431, seq=604, quorate=1 > Feb 22 14:17:59 ujboss1 pengine: [12761]: notice: unpack_config: On loss of > CCM Quorum: Ignore > Feb 22 14:17:59 ujboss1 pengine: [12761]: info: unpack_config: Node scores: > 'red' = -INFINITY, 'yellow' = 0, 'green' = 0 > Feb 22 14:17:59 ujboss1 pengine: [12761]: info: determine_online_status: > Node ujboss2 is online > Feb 22 14:17:59 ujboss1 pengine: [12761]: info: unpack_status: Node ujboss1 > is in standby-mode > Feb 22 14:17:59 ujboss1 pengine: [12761]: info: determine_online_status: > Node ujboss1 is standby > Feb 22 14:17:59 ujboss1 pengine: [12761]: notice: group_print: Resource > Group: vifGroup > Feb 22 14:17:59 ujboss1 pengine: [12761]: notice: native_print: > clusterIP (ocf::heartbeat:IPaddr2): Started ujboss1 > Feb 22 14:17:59 ujboss1 pengine: [12761]: notice: native_print: > routing-jboss (lsb:routing-jboss): Started ujboss1 > Feb 22 14:17:59 ujboss1 pengine: [12761]: notice: RecurringOp: Start > recurring monitor (30s) for clusterIP on ujboss2 > Feb 22 14:17:59 ujboss1 pengine: [12761]: notice: RecurringOp: Start > recurring monitor (30s) for routing-jboss on ujboss2 > Feb 22 14:17:59 ujboss1 pengine: [12761]: notice: LogActions: Move resource > clusterIP (Started ujboss1 -> ujboss2) > Feb 22 14:17:59 ujboss1 pengine: [12761]: notice: LogActions: Move resource > routing-jboss (Started ujboss1 -> ujboss2) > Feb 22 14:18:00 ujboss1 cib: [10032]: info: write_cib_contents: Archived > previous version as /var/lib/heartbeat/crm/cib-56.raw > Feb 22 14:18:00 ujboss1 cib: [10032]: info: write_cib_contents: Wrote > version 0.292.0 of the CIB to disk (digest: > e978102f42a9b6512133008fe219e5a4) > Feb 22 14:18:00 ujboss1 cib: [10032]: info: retrieveCib: Reading cluster > configuration from: /var/lib/heartbeat/crm/cib.2Gj3jp (digest: > /var/lib/heartbeat/crm/cib.ipauMZ) > Feb 22 14:18:00 ujboss1 crmd: [12762]: info: do_state_transition: State > transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS > cause=C_IPC_MESSAGE origin=handle_response ] > Feb 22 14:18:00 ujboss1 crmd: [12762]: info: unpack_graph: Unpacked > transition 3286: 11 actions in 11 synapses > Feb 22 14:18:00 ujboss1 crmd: [12762]: info: do_te_invoke: Processing graph > 3286 (ref=pe_calc-dc-1329916679-3431) derived from > /var/lib/pengine/pe-input-11117.bz2 > Feb 22 14:18:00 ujboss1 crmd: [12762]: info: te_pseudo_action: Pseudo action > 15 fired and confirmed > Feb 22 14:18:00 ujboss1 crmd: [12762]: info: te_rsc_command: Initiating > action 10: stop routing-jboss_stop_0 on ujboss1 (local) > Feb 22 14:18:00 ujboss1 lrmd: [12759]: info: cancel_op: operation > monitor[67] on lsb::routing-jboss::routing-jboss for client 12762, its > parameters: CRM_meta_interval=[30000] CRM_meta_timeout=[20000] > crm_feature_set=[3.0.1] CRM_meta_name=[monitor] cancelled > Feb 22 14:18:00 ujboss1 pengine: [12761]: info: process_pe_message: > Transition 3286: PEngine Input stored in: > /var/lib/pengine/pe-input-11117.bz2 > Feb 22 14:18:00 ujboss1 crmd: [12762]: info: do_lrm_rsc_op: Performing > key=10:3286:0:cd17d699-5ac6-4c00-bc70-f61bba0c140c op=routing-jboss_stop_0 ) > Feb 22 14:18:00 ujboss1 lrmd: [12759]: info: rsc:routing-jboss:68: stop > Feb 22 14:18:00 ujboss1 crmd: [12762]: info: process_lrm_event: LRM > operation routing-jboss_monitor_30000 (call=67, status=1, cib-update=0, > confirmed=true) Cancelled > Feb 22 14:18:00 ujboss1 lrmd: [10035]: WARN: For LSB init script, no > additional parameters are needed. > Feb 22 14:18:00 ujboss1 lrmd: [12759]: info: RA output: > (routing-jboss:stop:stdout) Disabling traffic redirection from 128.1.13.9 to > 128.1.13.7 > Feb 22 14:18:00 ujboss1 lrmd: [12759]: info: RA output: > (routing-jboss:stop:stdout) [ > Feb 22 14:18:00 ujboss1 lrmd: [12759]: info: RA output: > (routing-jboss:stop:stdout) OK > Feb 22 14:18:00 ujboss1 lrmd: [12759]: info: RA output: > (routing-jboss:stop:stdout) ] > Feb 22 14:18:00 ujboss1 lrmd: [12759]: info: RA output: > (routing-jboss:stop:stdout) > Feb 22 14:18:00 ujboss1 lrmd: [12759]: info: RA output: > (routing-jboss:stop:stdout) > > Feb 22 14:18:00 ujboss1 crmd: [12762]: info: process_lrm_event: LRM > operation routing-jboss_stop_0 (call=68, rc=0, cib-update=3408, > confirmed=true) ok > Feb 22 14:18:00 ujboss1 cib: [12758]: ERROR: send_ais_message: Not connected > to AIS > Feb 22 14:18:00 ujboss1 crmd: [12762]: info: match_graph_event: Action > routing-jboss_stop_0 (10) confirmed on ujboss1 (rc=0) > Feb 22 14:18:00 ujboss1 crmd: [12762]: info: te_rsc_command: Initiating > action 7: stop clusterIP_stop_0 on ujboss1 (local) > Feb 22 14:18:00 ujboss1 lrmd: [12759]: info: cancel_op: operation > monitor[65] on ocf::IPaddr2::clusterIP for client 12762, its parameters: > CRM_meta_interval=[30000] ip=[128.1.13.9] cidr_netmask=[32] > CRM_meta_timeout=[20000] crm_feature_set=[3.0.1] CRM_meta_name=[monitor] > iflabel=[jbossfailover] cancelled > Feb 22 14:18:00 ujboss1 crmd: [12762]: info: do_lrm_rsc_op: Performing > key=7:3286:0:cd17d699-5ac6-4c00-bc70-f61bba0c140c op=clusterIP_stop_0 ) > Feb 22 14:18:00 ujboss1 lrmd: [12759]: info: rsc:clusterIP:69: stop > Feb 22 14:18:00 ujboss1 crmd: [12762]: info: process_lrm_event: LRM > operation clusterIP_monitor_30000 (call=65, status=1, cib-update=0, > confirmed=true) Cancelled > Feb 22 14:18:00 ujboss1 lrmd: [12759]: info: RA output: > (clusterIP:stop:stderr) logger: unknown facility name: none. > > Feb 22 14:18:00 ujboss1 lrmd: [12759]: info: RA output: > (clusterIP:stop:stderr) logger: unknown facility name: none. > > Feb 22 14:18:00 ujboss1 crmd: [12762]: info: process_lrm_event: LRM > operation clusterIP_stop_0 (call=69, rc=0, cib-update=3409, confirmed=true) > ok > Feb 22 14:18:00 ujboss1 cib: [12758]: ERROR: send_ais_message: Not connected > to AIS > Feb 22 14:18:00 ujboss1 crmd: [12762]: info: match_graph_event: Action > clusterIP_stop_0 (7) confirmed on ujboss1 (rc=0) > Feb 22 14:18:00 ujboss1 crmd: [12762]: info: te_pseudo_action: Pseudo action > 16 fired and confirmed > Feb 22 14:18:00 ujboss1 crmd: [12762]: info: te_pseudo_action: Pseudo action > 3 fired and confirmed > Feb 22 14:18:00 ujboss1 crmd: [12762]: info: te_pseudo_action: Pseudo action > 13 fired and confirmed > Feb 22 14:18:00 ujboss1 crmd: [12762]: info: te_rsc_command: Initiating > action 8: start clusterIP_start_0 on ujboss2 > Feb 22 14:18:01 ujboss1 cib: [12758]: ERROR: send_ais_message: Not connected > to AIS > Feb 22 14:18:01 ujboss1 crmd: [12762]: info: match_graph_event: Action > clusterIP_start_0 (8) confirmed on ujboss2 (rc=0) > Feb 22 14:18:01 ujboss1 crmd: [12762]: info: te_rsc_command: Initiating > action 9: monitor clusterIP_monitor_30000 on ujboss2 > Feb 22 14:18:01 ujboss1 crmd: [12762]: info: te_rsc_command: Initiating > action 11: start routing-jboss_start_0 on ujboss2 > Feb 22 14:18:01 ujboss1 cib: [12758]: ERROR: send_ais_message: Not connected > to AIS > > > > > Le 19/01/2012 17:38, Philippe Carbonnier a écrit : > > Hello Andreas, all, > > many thanks for your help. I have change my configurations, and will take > you informed. > Best regards, > Philippe > > Le 19/01/2012 14:38, Andreas Kurz a écrit : > > Hello, > > On 01/19/2012 12:21 PM, Philippe Carbonnier wrote: > > Hello Andreas, and others, > > I'm sorry not having well explain my problem (I'm french and don't speak > english fluently...). > > np .. I'm also not a native speaker ;-) > > The loop has arrived after putting ujoss1 (not ujboss2) standy, this is > why I put more log for ujboss1 and not ujboss2. > > I'll try to answer your questions : > the network setup is simple : ujboss1 ans ujboss2 are on the same lan, > without firewall between. Perhaps it's not what you was looking for? > > hmm ... looks really like a short split brain ... default timeouts in > corosync are quite short at 1000ms for token lost ... try e.g. "token: > 5000" in the "totem" section of your corosync config. > > Regards, > Andreas > > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org