Eric Renfro wrote:
Michael Schwartzkopff wrote:
Am Samstag, 26. Dezember 2009 11:55:57 schrieb Eric Renfro:
Michael Schwartzkopff wrote:
Am Samstag, 26. Dezember 2009 11:27:54 schrieb Eric Renfro:
Michael Schwartzkopff wrote:
Am Samstag, 26. Dezember 2009 10:52:38 schrieb Eric Renfro:
Michael Schwartzkopff wrote:
Am Samstag, 26. Dezember 2009 08:12:49 schrieb Eric Renfro:
Hello,

I'm trying to setup 2 nodes that'll run pacemaker with openais as
the communication layer. Ideally what I want is for router1 to be
the master node and take over for router2 if it comes back up fully functional again. In my setup, the routers are both internet-facing servers that toggle the external internet IP to whichever controls it at the time, and also handles the internal IP for the gateway for
internal systems to route via.

My problem is with Route in my setup, so far, and later getting
shorewall to start/stop per whichever nodes active.

Route, in my case in the setup I will show below, is failing to
start initially because I presume the internet IP address is not
fully initialized at the time it's trying to enable the route. If I do a crm resource cleanup failover-gw, it brings it up just fine. If
I try to move the router_cluster resource to router2 from router1
after it's fully up, it fails because of failover-gw on router2.
Very unlikely. If the IPaddr2 script finishes the IP address is up.
Please search for other reasons and grep "lrm.*failover-gw" in the
logs.

Here's my setup at present. For the moment, until I figure out how
to do it, shorewall is started manually, I want to automate this
once the setup is working, though, perhaps you guys could help me
with that as well.

primitive failover-int-ip ocf:heartbeat:IPaddr2 \
        params ip="192.168.0.1" \
        op monitor interval="2s"
primitive failover-ext-ip ocf:heartbeat:IPaddr2 \
        params ip="24.227.124.158" cidr_netmask="30"
broadcast="24.227.124.159" nic="net0" \
        op monitor interval="2s" \
        meta target-role="Started"
primitive failover-gw ocf:heartbeat:Route \
        params destination="0.0.0.0/0" gateway="24.227.124.157"
device="net0" \
        meta target-role="Started" \
        op monitor interval="2s"
group router_cluster failover-int-ip failover-ext-ip failover-gw
location router-master router_cluster \
rule $id="router-master-rule" $role="master" 100: #uname eq
router1

I would appreciate as much help as possible. I am fairly new to
pacemaker, but so far all but the Route part of this works well.
Please give us a chance to help you providing the interesting logs!
Sure..
Here's a big clip of a log grepped from just failover-gw, if this
helps hopefully, else, I can pinpoint more around what's happening,
the logs fill up pretty quickly as it's coming alive.

messages:Dec 26 02:00:21 router1 pengine: [4724]: info: unpack_rsc_op: failover-gw_monitor_0 on router2 returned 5 (not installed) instead of
the expected value: 7 (not running)
(...)

The rest of the logs is not needed. Just the first line tells you that that something is not installed correctly. Please read the lines just
abobe this line. Normally it tells you what is missing.

You also your read trough the routing resource agent in
/usr/lib/ocf/resource.d/heartbeat/Route

Greetings,
Hmmm..
I'm not seeing anything about it, here's a clip of the above lines, and
one line below the one saying (not installed).

Dec 26 05:00:21 router1 pengine: [4724]: info: determine_online_status:
Node router1 is online
Dec 26 05:00:21 router1 pengine: [4724]: info: unpack_rsc_op:
failover-gw_monitor_0 on router1 returned 0 (ok) instead of the expect
ed value: 7 (not running)
Dec 26 05:00:21 router1 pengine: [4724]: WARN: unpack_rsc_op: Operation
failover-gw_monitor_0 found resource failover-gw active on r
outer1
Dec 26 05:00:21 router1 pengine: [4724]: info: determine_online_status:
Node router2 is online
Dec 26 05:00:21 router1 pengine: [4724]: info: unpack_rsc_op:
failover-gw_monitor_0 on router2 returned 5 (not installed) instead of
 the expected value: 7 (not running)
Dec 26 05:00:21 router1 pengine: [4724]: ERROR: unpack_rsc_op: Hard
error - failover-gw_monitor_0 failed with rc=5: Preventing failover-gw
from re-starting on router2
Hi,

there must be other log entries. In the Router RA I have before err out the agent write reasons into the ocf_log(). What version of pacemaker and
cluster- glue do you have? What distribution you a running on?

Greetings,
I've checked all my logs. Syslog logs everything to my messages logfile,
so it should be there if anywhere.

I'm running OpenSUSE 11.2 which comes with heartbeat 2.99.3, pacemaker
1.0.1, openais 0.80.3, as to what all's running in this setup.

Hm. This is already a quite old verison of pacemaker. But it should run anyway. Please could you check the resource manually on router1.

export OCF_ROOT=/usr/lib/ocf
export OCF_RESKEY_destination="0.0.0.0/0"
export OCF_RESKEY_gateway="24.227.124.157"

/usr/lib/ocf/resource.d/heartbeat/Route monitor; echo $?
should reult in 0 (started) or 7 (not started)

/usr/lib/ocf/resource.d/heartbeat/Route start; echo $?
should add the default route and result in 0

/usr/lib/ocf/resource.d/heartbeat/Route monitor; echo $?
should result in 0 (started)

/usr/lib/ocf/resource.d/heartbeat/Route stop; echo $?
should delete the default route and result in 0

/usr/lib/ocf/resource.d/heartbeat/Route monitor; echo $?
should result in 7 (not started)

If this works not as expected, are the any error message?
Please see if you can debug the Route script.

Greetings,

I did all these tests, and all results came back normal. First monitor returned 7, not started, after starting, returned 0 and monitor returned 0, stop returned 0, and monitor after stopping returned 7.

Seems the error for me is further up initiallly which causes it to not start afterwards. Here's the current setup:

primitive intIP ocf:heartbeat:IPaddr2 \
params ip="192.168.0.1" cidr_netmask="16" broadcast="192.168.255.255" nic="lan0"
primitive extIP ocf:heartbeat:IPaddr2 \
params ip="24.227.124.158" cidr_netmask="30" broadcast="24.227.124.159" nic="net0"
primitive resRoute ocf:heartbeat:Route \
        params destination="0.0.0.0/0" gateway="24.227.124.157" \
primitive firewall lsb:shorewall
group router_cluster extIP intIP resRoute shorewall
location router-master router_cluster \
rule $id="router-master-rule" $role="master" 100: #uname eq router1

I have added blank lines in the logs to separate out specific event segments that shows it. One in particular, near the top specifically is what's causing the entire resRoute to fail completely:

Dec 27 00:24:40 router2 crmd: [25786]: info: process_lrm_event: LRM operation resRoute_monitor_0 (call=4, rc=5, cib-update=31, confirmed=true) complete not installed

This is with OpenSUSE 11.1 with the ha-cluster repository used with pacemaker 1.0.5, cluster-glue 1.0, heartbeat 3.0.0, openais 0.80.5, ha-resources 1.0 (which is heartbeat 3.99.x stuff I do believe). So fairly current versions now.

I'd been making my setup off of susestudio and hand picking the packages needed.

Any thoughts?

--
Eric Renfro


Aha!
The problem is in the Route script itself somewhere. In doing the same tests you exampled earlier, the very first monitor attempt on Route, while the net0 interface is empty and offline, I get the error that was shown in the previous log snipped:

Route[26705]: ERROR: Gateway address 24.227.124.157 is unreachable.

So the problem is, Route fails with an incorrect error code when it's just that it can't create a route because the interface is currently offline. It should report 7 because it's not started.

After looking again at: http://hg.linux-ha.org/agents/log/56b9100f9c49/heartbeat/Route and then finding out ocf_is_probe was non-existant, looking at: http://hg.linux-ha.org/agents/file/56b9100f9c49/heartbeat/.ocf-shellfuncs.in

I was able to patch together a fix that worked. The script of shellfuncs in that example didn't work with the -a $OCF_RESKEY_CRM_meta_interval because of too many arguments, but omitting that part entirely resolved the issue overall. It successfully brings the route up right from the start.


Now that that issue is resolved for the time being....

How would I make it once it takes down the route from resRoute when passing control back to the master server, it activates an alternative route to get itself back online through the 192.168.0.1 gateway? I don't even know where to begin to get this logic in place. All I know is it has something to do with colocation but how exactly I'm uncertain. Any advice and examples would be grateful.

--
Eric Renfro


_______________________________________________
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Reply via email to