OK, I answered my own question below...for the most part.

On 05/18/2012 02:26 PM, Matthew O'Connor wrote:
By the way, will Pacemaker or Corosync log something to the syslog if it
decides to fence a member?  Will it attempt to fence one that has flat
disappeared, or only one that it has become unable to stop services on?
I ask because I have a node that recently started spitting out
"rcu_sched_state detected stall on cpu..." whenever I'm not around.  The
surviving node recognizes that it has lost contact with this defunct
node, but by that point the DLM and/or OCFS2 is totally hosed and the
surviving node requires a hard-restart.  I guess my hope is that, were
fencing actually working on my cluster, the fence would happen before
the surviving node's DLM/OCFS2 drivers melted down (assuming the real
issue at hand isn't wiping out DLM/OCFS everywhere before the bad-node
is determined offline by the good-node).
I understand now that the DLM expects STONITH to be working, or else it will block forever - or until the failed node re-establishes contact. By the way, my thanks go out to the writer of the libvirt-based STONITH method. It worked great for me, and it was great to see it nuke my misbehaving virtual test node! OCFS2 also responded much better in that test environment - fencing makes such a difference...

Thanks again for the info on cman+corosync+pacemaker!




_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to