Le 21/05/2013 07:04, Alex Samad - Yieldbroker a écrit :
Hi
I have setup a small 2 node cluster that we are using for HA a java app.
Basically the requirement is to provide HA and later on load balancing.
My initial plan was to use
2 nodes of linux
Iptables cluster module to do the load balancing
Cluster software to do the failover.
I have left the load balancing for now, HA has been given a higher priority.
So I am using centos 6.3, with pacemaker 1.1.7 rpm's
I have 2 nodes and 1 VIP, the VIP determines which node is the active one.
The application is actual live on both nodes, its really only the VIP that moves
I use pacemaker to ensure 1 the application is running and to place the VIP in
the right place
I have create my own resource script
/usr/lib/ocf/resource.d/yb/ybrp
Used one of the others script files for that, but it test
1) the application is running by using ps
2) that the application is okay, it can make a call and test the result
The start stop basically touches a lock file
Monitor does the test
Status uses the lock file and does the tests as well
So here is the output from
crm configure show
node dc1wwwrp01
node dc1wwwrp02
primitive ybrpip ocf:heartbeat:IPaddr2 \
params ip="10.32.21.10" cidr_netmask="24" \
op monitor interval="5s"
primitive ybrpstat ocf:yb:ybrp \
op monitor interval="5s"
group ybrp ybrpip ybrpstat
property $id="cib-bootstrap-options" \
dc-version="1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14" \
cluster-infrastructure="openais" \
expected-quorum-votes="2" \
stonith-enabled="false" \
no-quorum-policy="ignore" \
last-lrm-refresh="1369092192"
is there anything I should be doing differently, I have seen colocation option
and something about affinity of resources, but I used group which would be the
best practise way of doing it ?
Probably writing a better ocf script that doesn't rely on 'ps' (eww!)
but on a real lsb 'status'.
my next step is to add in iptables cluster ip modules. It is controlled by a
/proc/.... control file. Basically you tell the OS how many nodes and which
node number this machine is looking after.
So I was going to make a resource for node number ie node 1 preference for node
1 and node 2 preference for node 2. So that when 1 node goes down it will bring
that resource over to it. That can be done by poking a number into the /proc
file
But I have seen some wierds things happen that I can explain or control.
Sometimes things go a bit off when I do a
/usr/sbin/crm_mon -1
I can see the resource have errors next to them and a message along the lines of
operation monitor failed 'insufficient privileges' (rc=4)
I normally just do a
crm resource cleanup ybrpstat
and things come back to normal, but I need to understand how it gets there and
why and what I can do to stop it
Maybe the user that runs pacemaker have the right to execute your 'ybrp'
ocf script but not the 'ps' that you used in it ?
It returns 4 (<> 0), it should be quite easy to read
/usr/lib/ocf/resource.d/yb/ybrp and search what would return 4 in the
"status" section.
this is from /var/log/messages
from node1
==========
May 21 09:02:35 dc1wwwrp01 cib[2351]: info: cib_stats: Processed 1
operations (0.00us average, 0% utilization) in the last 10min
May 21 09:09:28 dc1wwwrp01 crmd[2356]: info: crm_timer_popped: PEngine
Recheck Timer (I_PE_CALC) just popped (900000ms)
May 21 09:09:28 dc1wwwrp01 crmd[2356]: notice: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED
origin=crm_timer_popped ]
May 21 09:09:28 dc1wwwrp01 crmd[2356]: info: do_state_transition:
Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED
May 21 09:09:28 dc1wwwrp01 pengine[2355]: notice: unpack_config: On loss of
CCM Quorum: Ignore
May 21 09:09:28 dc1wwwrp01 pengine[2355]: error: unpack_rsc_op: Preventing
ybrpstat from re-starting on dc1wwwrp01: operation monitor failed 'insufficient
privileges' (rc=4)
May 21 09:09:28 dc1wwwrp01 pengine[2355]: warning: unpack_rsc_op: Processing
failed op ybrpstat_last_failure_0 on dc1wwwrp01: insufficient privileges (4)
May 21 09:09:28 dc1wwwrp01 pengine[2355]: error: unpack_rsc_op: Preventing
ybrpstat from re-starting on dc1wwwrp02: operation monitor failed 'insufficient
privileges' (rc=4)
May 21 09:09:28 dc1wwwrp01 pengine[2355]: warning: unpack_rsc_op: Processing
failed op ybrpstat_last_failure_0 on dc1wwwrp02: insufficient privileges (4)
May 21 09:09:28 dc1wwwrp01 pengine[2355]: notice: common_apply_stickiness:
ybrpstat can fail 999999 more times on dc1wwwrp01 before being forced off
May 21 09:09:28 dc1wwwrp01 pengine[2355]: notice: common_apply_stickiness:
ybrpstat can fail 999999 more times on dc1wwwrp02 before being forced off
May 21 09:09:28 dc1wwwrp01 pengine[2355]: notice: process_pe_message:
Transition 5487: PEngine Input stored in: /var/lib/pengine/pe-input-1485.bz2
May 21 09:09:28 dc1wwwrp01 crmd[2356]: notice: do_state_transition: State
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
cause=C_IPC_MESSAGE origin=handle_response ]
May 21 09:09:28 dc1wwwrp01 crmd[2356]: info: do_te_invoke: Processing graph
5487 (ref=pe_calc-dc-1369091368-5548) derived from
/var/lib/pengine/pe-input-1485.bz2
May 21 09:09:28 dc1wwwrp01 crmd[2356]: notice: run_graph: ==== Transition
5487 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pengine/pe-input-1485.bz2): Complete
May 21 09:09:28 dc1wwwrp01 crmd[2356]: notice: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]
May 21 09:12:35 dc1wwwrp01 cib[2351]: info: cib_stats: Processed 1
operations (0.00us average, 0% utilization) in the last 10min
May 21 09:23:12 dc1wwwrp01 crm_resource[5165]: error: unpack_rsc_op:
Preventing ybrpstat from re-starting on dc1wwwrp01: operation monitor failed
'insufficient privileges' (rc=4)
May 21 09:23:12 dc1wwwrp01 crm_resource[5165]: error: unpack_rsc_op:
Preventing ybrpstat from re-starting on dc1wwwrp02: operation monitor failed
'insufficient privileges' (rc=4)
May 21 09:23:12 dc1wwwrp01 cib[2351]: info: cib_process_request: Operation
complete: op cib_delete for section
//node_state[@uname='dc1wwwrp01']//lrm_resource[@id='ybrpstat']
(origin=local/crmd/5589, version=0.101.36): ok (rc=0)
May 21 09:23:12 dc1wwwrp01 crmd[2356]: info: delete_resource: Removing
resource ybrpstat for 5165_crm_resource (internal) on dc1wwwrp01
May 21 09:23:12 dc1wwwrp01 crmd[2356]: info: notify_deleted: Notifying
5165_crm_resource on dc1wwwrp01 that ybrpstat was deleted
May 21 09:23:12 dc1wwwrp01 crmd[2356]: warning: decode_transition_key: Bad
UUID (crm-resource-5165) in sscanf result (3) for 0:0:crm-resource-5165
May 21 09:23:12 dc1wwwrp01 attrd[2354]: notice: attrd_trigger_update: Sending flush
op to all hosts for: fail-count-ybrpstat (<null>)
May 21 09:23:12 dc1wwwrp01 cib[2351]: info: cib_process_request: Operation
complete: op cib_delete for section
//node_state[@uname='dc1wwwrp01']//lrm_resource[@id='ybrpstat']
(origin=local/crmd/5590, version=0.101.37): ok (rc=0)
May 21 09:23:12 dc1wwwrp01 crmd[2356]: info: abort_transition_graph:
te_update_diff:320 - Triggered transition abort (complete=1, tag=lrm_rsc_op,
id=ybrpstat_last_0, magic=0:0;3:3572:0:c348b36c-f6dd-4a7d-ac5b-01a3b8ce3c34,
cib=0.101.37) : Resource op removal
May 21 09:23:12 dc1wwwrp01 crmd[2356]: info: abort_transition_graph:
te_update_diff:320 - Triggered transition abort (complete=1, tag=lrm_rsc_op,
id=ybrpstat_last_0, magic=0:0;3:3572:0:c348b36c-f6dd-4a7d-ac5b-01a3b8ce3c34,
cib=0.101.37) : Resource op removal
From node2
===========
May 21 09:20:03 dc1wwwrp02 lrmd: [2045]: info: rsc:ybrpip:16: monitor
May 21 09:23:12 dc1wwwrp02 lrmd: [2045]: info: cancel_op: operation monitor[16]
on ocf::IPaddr2::ybrpip for client 2048, its parameters:
CRM_meta_name=[monitor] cidr_netmask=[24] crm_feature_set=[3.0.6]
CRM_meta_timeout=[20000] CRM_meta_interval=[5000] ip=[10.32.21.10] cancelled
May 21 09:23:12 dc1wwwrp02 lrmd: [2045]: info: rsc:ybrpip:20: stop
May 21 09:23:12 dc1wwwrp02 cib[2043]: info: apply_xml_diff: Digest
mis-match: expected dcee73fe6518ac0d4b3429425d5dfc16, calculated
4a39d2ad25d50af2ec19b5b24252aef8
May 21 09:23:12 dc1wwwrp02 cib[2043]: notice: cib_process_diff: Diff 0.101.36
-> 0.101.37 not applied to 0.101.36: Failed application of an update diff
May 21 09:23:12 dc1wwwrp02 cib[2043]: info: cib_server_process_diff:
Requesting re-sync from peer
May 21 09:23:12 dc1wwwrp02 cib[2043]: notice: cib_server_process_diff: Not
applying diff 0.101.36 -> 0.101.37 (sync in progress)
May 21 09:23:12 dc1wwwrp02 cib[2043]: notice: cib_server_process_diff: Not
applying diff 0.101.37 -> 0.102.1 (sync in progress)
May 21 09:23:12 dc1wwwrp02 cib[2043]: notice: cib_server_process_diff: Not
applying diff 0.102.1 -> 0.102.2 (sync in progress)
May 21 09:23:12 dc1wwwrp02 cib[2043]: notice: cib_server_process_diff: Not
applying diff 0.102.2 -> 0.102.3 (sync in progress)
May 21 09:23:12 dc1wwwrp02 cib[2043]: notice: cib_server_process_diff: Not
applying diff 0.102.3 -> 0.102.4 (sync in progress)
Any help or suggestions muchly appreciated
Also, fencing!
Thanks
Alex
--
Cheers,
Florian Crouzat
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org