[Pacemaker] external/ssh stonith and repeated reboots

James Harper Sat, 13 Oct 2012 23:09:09 -0700

I'm using external/ssh in my test cluster (a bunch of vm's), and for some 
reason the cluster has tried to terminate it but failed, like:


ct 14 15:54:45 ctest0 stonith-ng: [2006]: info: call_remote_stonith: Requesting 
that ctest2 perform op off ctest1
Oct 14 15:54:45 ctest0 stonith-ng: [2006]: info: call_remote_stonith: No 
remaining peers capable of terminating ctest1
Oct 14 15:54:45 ctest0 stonith-ng: [2006]: ERROR: remote_op_done: Already sent 
notifications for 'off of ctest1 by (null)' 
(op=6ca32814-1272-482f-bb67-f0b46daef78b, 
for=d49c9501-bcd3-4563-87a0-303f1b6d4c22, state=1): Operation timed out
Oct 14 15:54:46 ctest0 stonith-ng: [2006]: info: call_remote_stonith: 
Requesting that ctest2 perform op off ctest1
Oct 14 15:54:46 ctest0 stonith-ng: [2006]: info: call_remote_stonith: No 
remaining peers capable of terminating ctest1
Oct 14 15:54:46 ctest0 stonith-ng: [2006]: ERROR: remote_op_done: Already sent 
notifications for 'off of ctest1 by (null)' 
(op=6ca32814-1272-482f-bb67-f0b46daef78b, 
for=d49c9501-bcd3-4563-87a0-303f1b6d4c22, state=1): Operation timed out
Oct 14 15:54:47 ctest0 stonith-ng: [2006]: info: call_remote_stonith: 
Requesting that ctest2 perform op off ctest1
Oct 14 15:54:47 ctest0 stonith-ng: [2006]: info: call_remote_stonith: No 
remaining peers capable of terminating ctest1
Oct 14 15:54:47 ctest0 stonith-ng: [2006]: ERROR: remote_op_done: Already sent 
notifications for 'off of ctest1 by (null)' 
(op=6ca32814-1272-482f-bb67-f0b46daef78b, 
for=d49c9501-bcd3-4563-87a0-303f1b6d4c22, state=1): Operation timed out
Oct 14 15:54:48 ctest0 stonith-ng: [2006]: info: call_remote_stonith: 
Requesting that ctest2 perform op off ctest1
Oct 14 15:54:48 ctest0 stonith-ng: [2006]: info: call_remote_stonith: No 
remaining peers capable of terminating ctest1
Oct 14 15:54:48 ctest0 stonith-ng: [2006]: ERROR: remote_op_done: Already sent 
notifications for 'off of ctest1 by (null)' 
(op=6ca32814-1272-482f-bb67-f0b46daef78b, 
for=d49c9501-bcd3-4563-87a0-303f1b6d4c22, state=1): Operation timed out

And I'll look into why that is, but the result is that there are 20 'at' jobs 
in the queue and every time the machine starts up it shuts down again. Easy 
enough to fix but it probably shouldn't happen (even for external/ssh which is 
advertised as 'not for production').

Is there a way to schedule an at job in such a way that it cancels a currently 
scheduled job? I can't see one...

James

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[Pacemaker] external/ssh stonith and repeated reboots

Reply via email to