Hi there,
I'm using Pacemaker 1.1.10 on a Debian cluster of two machines, those
are connected to an APC power switch which I can contact via command
line in this way:
# fence_apc -a <ACPADDR> -x -l <USER> -p <PASS> -n 1 -o status
Status: ON
And for which I've configured two fence resource in this way:
primitive st_fence_scv1 stonith:fence_apc \
params ipaddr="<APCADDR>" login="<USER>" passwd="<PASS>"
action="reboot" verbose="true" pcmk_host_check="static-list"
pcmk_host_list="scv1" secure="true" port="1" \
op monitor interval="60s"
When doing a clean start everything works fine. The problem is that,
ALWAYS, after almost an hour the monitor operation of the resource
fails. From the logs I see this:
Nov 21 21:46:12 [2661] scv1 stonith-ng: info: stonith_command:
Processed st_execute from lrmd.2662: Operation now in progress (-115)
Nov 21 21:46:12 [2661] scv1 stonith-ng: info: stonith_action_create:
Initiating action monitor for agent fence_apc (target=(null))
So the monitor is launched, and then:
Nov 21 21:46:32 [2661] scv1 stonith-ng: info: st_child_term:
Child 20854 timed out, sending SIGTERM
Nov 21 21:46:32 [2661] scv1 stonith-ng: notice:
stonith_action_async_done: Child process 20854 performing action
'monitor' timed out with signal 15
Nov 21 21:46:32 [2661] scv1 stonith-ng: notice: log_operation:
Operation 'monitor' [20854] for device 'st_fence_scv2' returned: -62
(Timer expired)
Nov 21 21:46:32 [2665] scv1 crmd: error: process_lrm_event:
LRM operation st_fence_scv2_monitor_60000 (464) Timed Out (timeout=20000ms)
So, there is a timeout (and this maybe possible since those APC devices
are very slow).
After this the device is stopped:
Nov 21 21:46:42 [2662] scv1 lrmd: info: log_execute:
executing - rsc:st_fence_scv2 action:stop call_id:469
Nov 21 21:46:42 [2661] scv1 stonith-ng: info: stonith_command:
Processed st_device_remove from lrmd.2662: OK (0)
And then restarted:
Nov 21 21:46:42 [2661] scv1 stonith-ng: info: stonith_action_create:
Initiating action metadata for agent fence_apc (target=(null))
Nov 21 21:46:42 [2661] scv1 stonith-ng: notice:
stonith_device_register: Device 'st_fence_scv2' already existed in
device list (1 active devices)
Nov 21 21:46:42 [2661] scv1 stonith-ng: info: stonith_command:
Processed st_device_register from lrmd.2662: OK (0)
Nov 21 21:46:42 [2661] scv1 stonith-ng: info: stonith_command:
Processed st_execute from lrmd.2662: Operation now in progress (-115)
Nov 21 21:46:42 [2661] scv1 stonith-ng: info: stonith_action_create:
Initiating action monitor for agent fence_apc (target=(null))
The first thing that I find strange is the "already existed in device
list", but anyway after this the monitor fails again:
Nov 21 21:47:02 [2661] scv1 stonith-ng: info: st_child_term:
Child 21265 timed out, sending SIGTERM
Nov 21 21:47:02 [2661] scv1 stonith-ng: notice:
stonith_action_async_done: Child process 21265 performing action
'monitor' timed out with signal 15
Nov 21 21:47:02 [2661] scv1 stonith-ng: notice: log_operation:
Operation 'monitor' [21265] for device 'st_fence_scv2' returned: -62
(Timer expired)
...
...
Nov 21 21:47:03 [2661] scv1 stonith-ng: info: stonith_command:
Processed st_device_remove from lrmd.2662: OK (0)
After this resources remains in stopped state. Why this happens? Am I in
this case: https://github.com/ClusterLabs/pacemaker/pull/334 ?
What kind of workaround can I use?
Thanks a lot, as usual.
--
RaSca
Mia Mamma Usa Linux: Niente รจ impossibile da capire, se lo spieghi bene!
[email protected]
http://www.miamammausalinux.org
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems