On Mon, May 5, 2014 at 7:32 PM, Andrew Beekhof <and...@beekhof.net> wrote:
> > On 3 May 2014, at 6:20 am, Radoslaw Garbacz < > radoslaw.garb...@xtremedatainc.com> wrote: > > > Hi, > > > > I have a strange situation, which I would like to ask about, whether it > is a bug, misconfiguration or an intended behavior. > > Sort version: Thats not a valid test > Medium version: Thats not a valid test and there are updates available for > pacemaker in el6 > Long version: Using iptables in this way not only stops the cluster from > seeing its peer, but also stops the cluster from talking to itself on the > same node. At which point nothing will work. > > Did you configure fencing? > Yes. Thank you for your suggestions and help. > > > > > A disconnected node does not detect it is lost, and does not perform any > actions to stop, even though resource agents report errors when monitored, > just the number of processes (of some hanged resource agents) keeps growing. > > > > Seems like pacemaker ignores timeouts when trying to update CIB. > > > > The situation is caused by corosync not detecting lost quorum due to > firewall blocking lo. As far as I checked this prevents corosync from > detecting problems with the cluster, and when lo access is restored > everything should be fine, but shouldn't pacemaker detect lost CIB service > and do something about it? Maybe there is a configuration parameter to > control this? > > > > Technical details: > > > > 1) > > 1.1) machine: Amazon Linux: Linux ... 3.10.35-43.137.amzn1.x86_64 #1 > SMP Wed Apr 2 09:36:59 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux > > 1.2) Pacemaker: Pacemaker 1.1.9-1512.el6 > > 1.3) corosync: Corosync Cluster Engine, version '2.3.2' > > > > > > 2) Net: basic: ethx, lo > > > > 3) iptables: > > *filter > > :INPUT ACCEPT [0:0] > > :FORWARD ACCEPT [0:0] > > :OUTPUT ACCEPT [0:0] > > -A INPUT -p tcp -m tcp -s <my_machine> --dport 22 -j ACCEPT > > -A INPUT -j DROP > > -A OUTPUT -p tcp -m tcp -d <my_machine> --sport 22 -j ACCEPT > > -A OUTPUT -j DROP > > COMMIT > > > > 4) crm config: > > <crm_config> > > <cluster_property_set id="cib-bootstrap-options"> > > <nvpair id="cib-bootstrap-options-stonith-enabled" > name="stonith-enabled" value="false"/> > > <nvpair id="cib-bootstrap-options-no-quorum-policy" > name="no-quorum-policy" value="stop"/> > > <nvpair id="cib-bootstrap-options-stop-orphan-resources" > name="stop-orphan-resources" value="true"/> > > <nvpair id="cib-bootstrap-options-start-failure-is-fatal" > name="start-failure-is-fatal" value="true"/> > > <nvpair id="cib-bootstrap-options-expected-quorum-votes" > name="expected-quorum-votes" value="3"/> > > <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" > value="1.1.9-1512.el6-2a917dd"/> > > <nvpair id="cib-bootstrap-options-cluster-infrastructure" > name="cluster-infrastructure" value="corosync"/> > > </cluster_property_set> > > </crm_config> > > > > > > 5) Example resource config: > > <primitive class="ocf" id="dbx_ready_nodes" provider="dbxcl" type=" > ready.ocf.sh"> > > <instance_attributes id="dbx_ready_nodes-instance_attributes"> > > <nvpair id="dbx_ready_nodes-instance_attributes-dbxclrole" > name="dbxclrole" value="''"/> > > </instance_attributes> > > <operations> > > <op id="dbx_ready_nodes-start-timeout-1min-on-fail-stop" > interval="0s" name="start" on-fail="stop" timeout="1min"/> > > <op id="dbx_ready_nodes-stop-timeout-8min" interval="0s" > name="stop" timeout="8min"/> > > <op id="dbx_ready_nodes-monitor-interval-83s" interval="83s" > name="monitor" on-fail="stop" timeout="60s"/> > > <op id="dbx_ready_nodes-validate-all-interval-29s" > interval="29s" name="validate-all" on-fail="stop" timeout="60s"/> > > </operations> > > </primitive> > > > > > > 6) Logs: > > Below a resource "dbx_ready_nodes" monitor action returns error, but > nothing happens, the resource is not being requested to stop (even though > it should, as can be seen above) > > > > May 02 20:04:13 [16191] ip-10-116-169-85 lrmd: debug: > operation_finished: dbx_ready_nodes_monitor_83000:8669 - exited with > rc=1 > > May 02 20:04:13 [16191] ip-10-116-169-85 lrmd: debug: > log_finished: finished - rsc:dbx_ready_nodes action:monitor call_id:142 > pid:8669 exit-code:1 exec-time:0ms queue-time:0ms > > May 02 20:04:13 [16154] ip-10-116-169-85 corosync debug [TOTEM ] > sendmsg(mcast) failed (non-critical): Operation not permitted (1) > > May 02 20:04:13 [16154] ip-10-116-169-85 corosync debug [TOTEM ] > sendmsg(mcast) failed (non-critical): Operation not permitted (1) > > May 02 20:04:13 [16154] ip-10-116-169-85 corosync debug [TOTEM ] > sendmsg(mcast) failed (non-critical): Operation not permitted (1) > > May 02 20:04:13 [16154] ip-10-116-169-85 corosync debug [TOTEM ] > sendmsg(mcast) failed (non-critical): Operation not permitted (1) > > May 02 20:04:13 [16154] ip-10-116-169-85 corosync debug [TOTEM ] > sendmsg(mcast) failed (non-critical): Operation not permitted (1) > > May 02 20:04:13 [16154] ip-10-116-169-85 corosync debug [TOTEM ] > sendmsg(mcast) failed (non-critical): Operation not permitted (1) > > May 02 20:04:13 [16154] ip-10-116-169-85 corosync warning [MAIN ] Totem > is unable to form a cluster because of an operating system or network > fault. The most common cause of this message is that th > > e local firewall is configured improperly. > > > > > > Thanks in advance > > > > -- > > Best Regards, > > > > Radoslaw Garbacz > > XtremeData Incorporation > > _______________________________________________ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > > -- Best Regards, Radoslaw Garbacz XtremeData Incorporation
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org