[Linux-HA] fence_apc always fails after some time and resources remains stopped

RaSca Fri, 22 Nov 2013 01:27:22 -0800

Hi there,
I'm using Pacemaker 1.1.10 on a Debian cluster of two machines, those
are connected to an APC power switch which I can contact via command
line in this way:


# fence_apc -a <ACPADDR> -x -l <USER> -p <PASS> -n 1 -o status
Status: ON

And for which I've configured two fence resource in this way:

primitive st_fence_scv1 stonith:fence_apc \
        params ipaddr="<APCADDR>" login="<USER>" passwd="<PASS>"
action="reboot" verbose="true" pcmk_host_check="static-list"
pcmk_host_list="scv1" secure="true" port="1" \
        op monitor interval="60s"

When doing a clean start everything works fine. The problem is that,
ALWAYS, after almost an hour the monitor operation of the resource
fails. From the logs I see this:

Nov 21 21:46:12 [2661] scv1 stonith-ng:     info: stonith_command:
Processed st_execute from lrmd.2662: Operation now in progress (-115)
Nov 21 21:46:12 [2661] scv1 stonith-ng:     info: stonith_action_create:
       Initiating action monitor for agent fence_apc (target=(null))

So the monitor is launched, and then:

Nov 21 21:46:32 [2661] scv1 stonith-ng:     info: st_child_term:
Child 20854 timed out, sending SIGTERM
Nov 21 21:46:32 [2661] scv1 stonith-ng:   notice:
stonith_action_async_done:    Child process 20854 performing action
'monitor' timed out with signal 15
Nov 21 21:46:32 [2661] scv1 stonith-ng:   notice: log_operation:
Operation 'monitor' [20854] for device 'st_fence_scv2' returned: -62
(Timer expired)
Nov 21 21:46:32 [2665] scv1       crmd:    error: process_lrm_event:
LRM operation st_fence_scv2_monitor_60000 (464) Timed Out (timeout=20000ms)

So, there is a timeout (and this maybe possible since those APC devices
are very slow).
After this the device is stopped:

Nov 21 21:46:42 [2662] scv1       lrmd:     info: log_execute:
executing - rsc:st_fence_scv2 action:stop call_id:469
Nov 21 21:46:42 [2661] scv1 stonith-ng:     info: stonith_command:
Processed st_device_remove from lrmd.2662: OK (0)

And then restarted:

Nov 21 21:46:42 [2661] scv1 stonith-ng:     info: stonith_action_create:
       Initiating action metadata for agent fence_apc (target=(null))
Nov 21 21:46:42 [2661] scv1 stonith-ng:   notice:
stonith_device_register:      Device 'st_fence_scv2' already existed in
device list (1 active devices)
Nov 21 21:46:42 [2661] scv1 stonith-ng:     info: stonith_command:
Processed st_device_register from lrmd.2662: OK (0)
Nov 21 21:46:42 [2661] scv1 stonith-ng:     info: stonith_command:
Processed st_execute from lrmd.2662: Operation now in progress (-115)
Nov 21 21:46:42 [2661] scv1 stonith-ng:     info: stonith_action_create:
       Initiating action monitor for agent fence_apc (target=(null))

The first thing that I find strange is the "already existed in device
list", but anyway after this the monitor fails again:

Nov 21 21:47:02 [2661] scv1 stonith-ng:     info: st_child_term:
Child 21265 timed out, sending SIGTERM
Nov 21 21:47:02 [2661] scv1 stonith-ng:   notice:
stonith_action_async_done:    Child process 21265 performing action
'monitor' timed out with signal 15
Nov 21 21:47:02 [2661] scv1 stonith-ng:   notice: log_operation:
Operation 'monitor' [21265] for device 'st_fence_scv2' returned: -62
(Timer expired)
...
...
Nov 21 21:47:03 [2661] scv1 stonith-ng:     info: stonith_command:
Processed st_device_remove from lrmd.2662: OK (0)

After this resources remains in stopped state. Why this happens? Am I in
this case: https://github.com/ClusterLabs/pacemaker/pull/334 ?

What kind of workaround can I use?

Thanks a lot, as usual.

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
[email protected]
http://www.miamammausalinux.org
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] fence_apc always fails after some time and resources remains stopped

Reply via email to