RE: [Linux-HA] Failover not working as I expected

Jerome Yanga Mon, 26 Jan 2009 14:23:38 -0800

Andrew,

I apologize for my sending my previous email abruptly.


I have followed your recommendation and installed Pacemaker.

Here is my config.

Packages Installed:
heartbeat-2.99.2-6.1
heartbeat-common-2.99.2-6.1
heartbeat-debug-2.99.2-6.1
heartbeat-ldirectord-2.99.2-6.1
heartbeat-resources-2.99.2-6.1
libheartbeat2-2.99.2-6.1
libpacemaker3-1.0.1-3.1
pacemaker-1.0.1-3.1
pacemaker-debug-1.0.1-3.1
pacemaker-pygui-1.4-11.9
pacemaker-pygui-debug-1.4-11.9



ha.cf:
# Logging
debug                           1
use_logd                        false
logfacility                     daemon

# Misc Options
traditional_compression         off
compression                     bz2
coredumps                       true

# Communications
udpport                 691
bcast                           eth1 eth0
autojoin                        any
  
# Thresholds (in seconds)
keepalive                       1
warntime                        6
deadtime                        10
initdead                        15

ping 10.50.254.254
crm respawn
 apiauth        mgmtd   uid=root
 respawn        root    /usr/lib/heartbeat/mgmtd -v


cib.xml:
<cib admin_epoch="0" validate-with="pacemaker-1.0" crm_feature_set="3.0" 
have-quorum="1" epoch="57" dc-uuid="5e3e3c2d-55e7-4c51-90be-5c4a1912bf3e" 
num_updates="0" cib-last-written="Mon Jan 26 13:57:32 2009">
  <configuration>
    <crm_config>
      <cluster_property_set id="cib-bootstrap-options">
        <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" 
value="1.0.1-node: 6fc5ce8302abf145a02891ec41e5a492efbe8efe"/>
      </cluster_property_set>
    </crm_config>
    <nodes>
      <node id="5e3e3c2d-55e7-4c51-90be-5c4a1912bf3e" uname="nomen.esri.com" 
type="normal">
        <instance_attributes id="nodes-5e3e3c2d-55e7-4c51-90be-5c4a1912bf3e">
          <nvpair id="standby-5e3e3c2d-55e7-4c51-90be-5c4a1912bf3e" 
name="standby" value="off"/>
        </instance_attributes>
      </node>
      <node id="27f54ec3-b626-4b4f-b8a6-4ed0b768513c" uname="rubric.esri.com" 
type="normal">
        <instance_attributes id="nodes-27f54ec3-b626-4b4f-b8a6-4ed0b768513c">
          <nvpair id="standby-27f54ec3-b626-4b4f-b8a6-4ed0b768513c" 
name="standby" value="off"/>
        </instance_attributes>
      </node>
    </nodes>
    <resources>
      <group id="Directory_Server">
        <meta_attributes id="Directory_Server-meta_attributes">
          <nvpair id="Directory_Server-meta_attributes-collocated" 
name="collocated" value="true"/>
          <nvpair id="Directory_Server-meta_attributes-ordered" name="ordered" 
value="true"/>
          <nvpair id="Directory_Server-meta_attributes-resource_stickiness" 
name="resource_stickiness" value="100"/>
        </meta_attributes>
        <primitive class="ocf" id="VIP" provider="heartbeat" type="IPaddr">
          <instance_attributes id="VIP-instance_attributes">
            <nvpair id="VIP-instance_attributes-ip" name="ip" 
value="10.50.26.250"/>
          </instance_attributes>
          <operations id="VIP-ops">
            <op id="VIP-monitor-5s" interval="5s" name="monitor" timeout="5s"/>
          </operations>
        </primitive>
        <primitive class="ocf" id="ECAS" provider="esri" type="ecas">
          <operations id="ECAS-ops">
            <op id="ECAS-monitor-3s" interval="3s" name="monitor" timeout="3s"/>
          </operations>
          <meta_attributes id="ECAS-meta_attributes">
            <nvpair id="ECAS-meta_attributes-target-role" name="target-role" 
value="Started"/>
          </meta_attributes>
        </primitive>
        <primitive class="ocf" id="FDS_Admin" provider="esri" type="fdsadm">
          <operations id="FDS_Admin-ops">
            <op id="FDS_Admin-monitor-3s" interval="3s" name="monitor" 
timeout="3s"/>
          </operations>
        </primitive>
      </group>
    </resources>
    <constraints>
      <rsc_location id="cli-prefer-Directory_Server" rsc="Directory_Server">
        <rule id="cli-prefer-rule-Directory_Server" score="INFINITY" 
boolean-op="and">
          <expression id="cli-prefer-expr-Directory_Server" attribute="#uname" 
operation="eq" value="rubric.esri.com" type="string"/>
        </rule>
      </rsc_location>
      <rsc_location id="cli-prefer-FDS_Admin" rsc="FDS_Admin">
        <rule id="cli-prefer-rule-FDS_Admin" score="INFINITY" boolean-op="and">
          <expression id="cli-prefer-expr-FDS_Admin" attribute="#uname" 
operation="eq" value="nomen.esri.com" type="string"/>
        </rule>
      </rsc_location>
    </constraints>
  </configuration>
</cib>



I still have the following issues when I only had heartbeat 2.1.3-1.  My 
concerns are still as follows:

01)  When a node comes back up after a restart of heartbeat, resources gets 
bounced when it rejoins the cluster.
02)  Stopping one resource in a group does not failover the group to the other 
node.

Help.

Regards,
Jerome





-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Andrew Beekhof
Sent: Tuesday, January 20, 2009 1:33 PM
To: General Linux-HA mailing list
Subject: Re: [Linux-HA] Failover not working as I expected

On Tue, Jan 20, 2009 at 21:48, Jerome Yanga <[email protected]> wrote:
> Dominik,
>
> Per your request, attached is my current configuration.
>
> To reiterate, the following are still concerns:
>
> 01)  Resources gets bounced when Nomen rejoins the cluster.
> 02)  Group failover will not work as hoped.
>
> As per resource monitoring, I believe that the customized init scripts are 
> working properly; however, me being a noob seems to contradict this.  I have 
> tested the init scripts in a way that when a failure of the resource is 
> experienced the service is restarted.  After seeing that the init script is 
> working, I have set the "On Fail" value to "stop" instead of "restart".
>
> Moreover, I have tried varying the group scores by changing the 
> resource_stickiness and the resource_failure_stickiness values.

I would highly encourage you to upgrade to the latest stable series of
Pacemaker.
The whole failure stickiness nonsense has been completely dropped in
favor of something thats actually usable.

http://clusterlabs.org/wiki/Install
http://clusterlabs.org/wiki/Documentation <-- look for the 1.0 version
of configuration explained

> However, I have not been able to consistently failover the group by stopping 
> one of the resources.  During the testing, I have tried using the equation 
> below from the site you provided in your previous email.
>
> node = (constraint-score) + (num_group_resources * resource_stickiness) + 
> (failcount * (resource_failure_stickiness) )
>
> Unfortunately, the scores does not seem to follow this equation as I would 
> verify them using the showscores.sh.  The following values were assign to the 
> Directory_Server group during this testing.
>
> resource_stickiness=100
> resource_failure_stickiness=-500
>
> I have also attempted to use the crm_failcount command to make sure that the 
> scores prior to failing any resource gets reset, but showscores.sh seems to 
> show that the command is not working.
>
> I have also tried to change the cib.xml file manually to assign the values 
> above to default-resource-stickiness and default-resource-failure-stickiness 
> respectively, but after doing so, all the resources seems to disappear.  
> (Good thing I had created a copy of the cib.xml file.)
>
> By the way, I have changed the values back to the following:
>
> resource_stickiness=100
> resource_failure_stickiness=-100
>
> Help.
>
> Regards,
> Jerome
>
>
>
>
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of Dominik Klein
> Sent: Monday, January 19, 2009 11:31 PM
> To: General Linux-HA mailing list
> Subject: Re: [Linux-HA] Failover not working as I expected
>
> Jerome Yanga wrote:
>> Dominik,
>>
>> Thank you much.   Adding "resource-stickiness" and getting rid of the 
>> constraint helped a lot.  The resources does not go back to Nomen anymore 
>> when it's heartbeat is started again  (resources stays with Rubric).  
>> However, the resources still gets bounced once Nomen joins the cluster.  Is 
>> there any way to keep the resources from bouncing when Nomen rejoins the 
>> cluster?
>
> Please share your current configuration.
>
>> I have also observed another issue.  As you have seen in my cib.xml, I have 
>> created a group called Directory_Server.  In this group, there are three 
>> resources, namely:  VIP, ECAS and FDS_Admin.  If I manually turn off any of 
>> these resources, I would like the group resource, Directory_Server, to 
>> failover to the other node.  Is there a configuration that will do this?  
>> Currently, if one of three resources goes down it stays down and the rest 
>> continues running.  All three resources will need to be up and running for 
>> our applications to work properly.
>
> Sounds like you're not doing any resource monitoring. Read up on that
> and configure it. The ScoreCalculation page might be handy to understand
> how things work: http://www.linux-ha.org/ScoreCalculation
>
> Regards
> Dominik
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

RE: [Linux-HA] Failover not working as I expected

Reply via email to