try crm ra test nginx lb02 start
2013/10/22 Lucas Brown <lu...@locatrix.com> > Hey guys, > > I'm encountering a really strange problem testing failover of my > ocf:heartbeat:nginx resource in my 2 node cluster. I am able to manually > migrate the resource around the nodes and that works fine, but I can't get > the resource to function on one node while the other has encountered a > failure. The strange part is this only happens if the failure was on node > 1. If I reproduce the failure on node 2 the resource will correctly > failover to node 1. > > no-quorum-policy is ignore, so that doesn't seem to be the issue, and some > similar threads mentioned start-failure-is-fatal=false may help, but it > doesn't resolve it either. I have a more advanced configuration that > includes a Virtual IP and ping clones, and those parts seem to work fine, > and nginx even failsover correctly when its host goes offline completely. > Just can't get the same behaviour to work when only the resource fails. > > My test case: > > >vim /etc/nginx/nginx.conf > >Insert invalid jargon and save > >service nginx restart > > Expected outcome: Resource failsover to the other node upon monitor > failure in either direction between my 2 nodes. > Actual: Resource failsover correctly from node 2 -> node 1, but not node 1 > -> node 2. > > This is my test configuration for reproducing the issue (to make sure my > other stuff isn't interfering). > ----------------------- > node $id="724150464" lb01 > node $id="740927680" lb02 > primitive nginx ocf:heartbeat:nginx \ > params configfile="/etc/nginx/nginx.conf" \ > op monitor interval="10s" timeout="30s" depth="0" \ > op monitor interval="15s" timeout="30s" status10url=" > http://localhost/nginx_status" depth="10" > property $id="cib-bootstrap-options" \ > dc-version="1.1.10-42f2063" \ > cluster-infrastructure="corosync" \ > stonith-enabled="false" \ > no-quorum-policy="ignore" \ > start-failure-is-fatal="false" \ > last-lrm-refresh="1382410708" > rsc_defaults $id="rsc-options" \ > resource-stickiness="100" > > This is what happens when I perform the test case on node lb02, and it > correctly migrates/restarts the resource on lb01. > ----------------------- > Oct 22 11:58:12 [694] lb02 pengine: warning: unpack_rsc_op: Processing > failed op monitor for nginx on lb02: not running (7) > Oct 22 11:58:12 [694] lb02 pengine: info: native_print: nginx > (ocf::heartbeat:nginx): Started lb02 FAILED > Oct 22 11:58:12 [694] lb02 pengine: info: RecurringOp: Start > recurring monitor (10s) for nginx on lb02 > Oct 22 11:58:12 [694] lb02 pengine: info: RecurringOp: Start > recurring monitor (15s) for nginx on lb02 > Oct 22 11:58:12 [694] lb02 pengine: notice: LogActions: Recover nginx > (Started > lb02) > Oct 22 11:58:12 [690] lb02 cib: info: cib_process_request: > Completed > cib_query operation for section > //cib/status//node_state[@id='740927680']//transient_attributes//nvpair[@name='fail-count-nginx']: > No such device or address (rc=-6, origin=local/attrd/1038, version=0.252.2) > Oct 22 11:58:12 [692] lb02 lrmd: info: cancel_recurring_action: > Cancelling > operation nginx_monitor_15000 > Oct 22 11:58:12 [692] lb02 lrmd: info: cancel_recurring_action: > Cancelling > operation nginx_monitor_10000 > Oct 22 11:58:12 [692] lb02 lrmd: info: log_execute: executing - > rsc:nginx action:stop call_id:848 > Oct 22 11:58:12 [695] lb02 crmd: info: process_lrm_event: LRM > operation nginx_monitor_15000 (call=839, status=1, cib-update=0, > confirmed=true) Cancelled > Oct 22 11:58:12 [695] lb02 crmd: info: process_lrm_event: LRM > operation nginx_monitor_10000 (call=841, status=1, cib-update=0, > confirmed=true) Cancelled > Oct 22 11:58:12 [690] lb02 cib: info: cib_process_request: > Completed > cib_query operation for section > //cib/status//node_state[@id='740927680']//transient_attributes//nvpair[@name='last-failure-nginx']: > OK (rc=0, origin=local/attrd/1041, version=0.252.3) > nginx[31237]: 2013/10/22_11:58:12 INFO: nginx is not running. > Oct 22 11:58:12 [692] lb02 lrmd: info: log_finished: finished - > rsc:nginx action:stop call_id:848 pid:31237 exit-code:0 exec-time:155ms > queue-time:0ms > Oct 22 11:58:12 [695] lb02 crmd: notice: process_lrm_event: LRM > operation nginx_stop_0 (call=848, rc=0, cib-update=593, confirmed=true) ok > Oct 22 11:58:12 [694] lb02 pengine: info: unpack_rsc_op: Operation > monitor found resource nginx active on lb01 > Oct 22 11:58:12 [694] lb02 pengine: warning: unpack_rsc_op: Processing > failed op monitor for nginx on lb02: not running (7) > Oct 22 11:58:12 [694] lb02 pengine: info: native_print: nginx > (ocf::heartbeat:nginx): Stopped > Oct 22 11:58:12 [694] lb02 pengine: info: get_failcount_full: nginx > has failed 1 times on lb02 > Oct 22 11:58:12 [694] lb02 pengine: info: common_apply_stickiness: > nginx > can fail 999999 more times on lb02 before being forced off > Oct 22 11:58:12 [694] lb02 pengine: info: RecurringOp: Start > recurring monitor (10s) for nginx on lb01 > Oct 22 11:58:12 [694] lb02 pengine: info: RecurringOp: Start > recurring monitor (15s) for nginx on lb01 > Oct 22 11:58:12 [694] lb02 pengine: notice: LogActions: Start nginx > (lb01) > > > This is what happens when I try to go from lb01 -> lb02. > ----------------------- > Oct 22 12:00:25 [694] lb02 pengine: warning: unpack_rsc_op: Processing > failed op monitor for nginx on lb01: not running (7) > Oct 22 12:00:25 [694] lb02 pengine: info: unpack_rsc_op: Operation > monitor found resource nginx active on lb02 > Oct 22 12:00:25 [694] lb02 pengine: info: native_print: nginx > (ocf::heartbeat:nginx): Started lb01 FAILED > Oct 22 12:00:25 [694] lb02 pengine: info: RecurringOp: Start > recurring monitor (10s) for nginx on lb01 > Oct 22 12:00:25 [694] lb02 pengine: info: RecurringOp: Start > recurring monitor (15s) for nginx on lb01 > Oct 22 12:00:25 [694] lb02 pengine: notice: LogActions: Recover nginx > (Started > lb01) > Oct 22 12:00:25 [690] lb02 cib: info: cib_process_request: > Completed > cib_query operation for section > //cib/status//node_state[@id='740927680']//transient_attributes//nvpair[@name='fail-count-nginx']: > No such device or address (rc=-6, origin=local/attrd/1046, version=0.253.12) > Oct 22 12:00:25 [690] lb02 cib: info: cib_process_request: > Completed > cib_query operation for section > //cib/status//node_state[@id='740927680']//transient_attributes//nvpair[@name='last-failure-nginx']: > OK (rc=0, origin=local/attrd/1047, version=0.253.12) > Oct 22 12:00:25 [694] lb02 pengine: warning: unpack_rsc_op: Processing > failed op monitor for nginx on lb01: not running (7) > Oct 22 12:00:25 [694] lb02 pengine: info: unpack_rsc_op: Operation > monitor found resource nginx active on lb02 > Oct 22 12:00:25 [694] lb02 pengine: info: native_print: nginx > (ocf::heartbeat:nginx): Stopped > Oct 22 12:00:25 [694] lb02 pengine: info: get_failcount_full: nginx > has failed 1 times on lb01 > Oct 22 12:00:25 [694] lb02 pengine: info: common_apply_stickiness: > nginx > can fail 999999 more times on lb01 before being forced off > Oct 22 12:00:25 [694] lb02 pengine: info: RecurringOp: Start > recurring monitor (10s) for nginx on lb01 > Oct 22 12:00:25 [694] lb02 pengine: info: RecurringOp: Start > recurring monitor (15s) for nginx on lb01 > Oct 22 12:00:25 [694] lb02 pengine: notice: LogActions: Start nginx > (lb01) > Oct 22 12:00:25 [694] lb02 pengine: error: unpack_rsc_op: Preventing > nginx from re-starting anywhere in the cluster : operation start failed > 'not configured' (rc=6) > Oct 22 12:00:25 [694] lb02 pengine: warning: unpack_rsc_op: Processing > failed op start for nginx on lb01: not configured (6) > Oct 22 12:00:25 [694] lb02 pengine: info: unpack_rsc_op: Operation > monitor found resource nginx active on lb02 > Oct 22 12:00:25 [694] lb02 pengine: info: native_print: nginx > (ocf::heartbeat:nginx): Started lb01 FAILED > Oct 22 12:00:25 [694] lb02 pengine: info: get_failcount_full: nginx > has failed 1 times on lb01 > Oct 22 12:00:25 [694] lb02 pengine: info: common_apply_stickiness: > nginx > can fail 999999 more times on lb01 before being forced off > Oct 22 12:00:25 [694] lb02 pengine: info: native_color: Resource > nginx cannot run anywhere > Oct 22 12:00:25 [694] lb02 pengine: notice: LogActions: Stop nginx > (lb01) > Oct 22 12:00:26 [690] lb02 cib: info: cib_process_request: > Completed > cib_query operation for section > //cib/status//node_state[@id='740927680']//transient_attributes//nvpair[@name='fail-count-nginx']: > No such device or address (rc=-6, origin=local/attrd/1049, version=0.253.15) > Oct 22 12:00:26 [690] lb02 cib: info: cib_process_request: > Completed > cib_query operation for section > //cib/status//node_state[@id='740927680']//transient_attributes//nvpair[@name='fail-count-nginx']: > No such device or address (rc=-6, origin=local/attrd/1050, version=0.253.15) > Oct 22 12:00:26 [694] lb02 pengine: error: unpack_rsc_op: Preventing > nginx from re-starting anywhere in the cluster : operation start failed > 'not configured' (rc=6) > Oct 22 12:00:26 [694] lb02 pengine: warning: unpack_rsc_op: Processing > failed op start for nginx on lb01: not configured (6) > Oct 22 12:00:26 [694] lb02 pengine: info: unpack_rsc_op: Operation > monitor found resource nginx active on lb02 > Oct 22 12:00:26 [694] lb02 pengine: info: native_print: nginx > (ocf::heartbeat:nginx): Stopped > Oct 22 12:00:26 [694] lb02 pengine: info: get_failcount_full: nginx > has failed INFINITY times on lb01 > Oct 22 12:00:26 [694] lb02 pengine: warning: common_apply_stickiness: > Forcing > nginx away from lb01 after 1000000 failures (max=1000000) > Oct 22 12:00:26 [694] lb02 pengine: info: native_color: Resource > nginx cannot run anywhere > Oct 22 12:00:26 [694] lb02 pengine: info: LogActions: Leave nginx > (Stopped) > > I can't for the life of me work out why this is happening. For whatever > reason in node 1 -> node 2, it randomly decides that the resource can no > longer run anywhere. > > And yes, I am making sure everything works before I start each test, so > its not failure to use crm resource cleanup etc. > > Would really appreciate help on this as I've been trying to debug this for > a few days and have hit a wall. > > Thanks, > > Lucas > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > > -- esta es mi vida e me la vivo hasta que dios quiera
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org