On 21.12.2011 13:07, Tim Serong wrote:
My guess would be:

The filesystem can't stop on the non-quorate node, because the network
connection is down, so DLM can't do its thing.

Ok.


The filesystem is probably frozen on the quorate node, because of loss
of DLM comms.

Ok, same problem as above then.


If STONITH is configured, the non-quorate node should be killed after a
failed (or timed out) stop, and the quorate node should resume behaving
normally.

HTH,

Tim

But lost DLM comm leads to *both* nodes hanging: the one in the process of being shut down by Pacemaker (because of lost quorum) and the one which is in the partition with quorum (and thus should live).

My point is that at least one OCFS2 node (the one in partition with quorum) should somehow survive the lost comm and stay healthy, but DLM (or something else) gets "stuck" and they both hang. That's the problem.


Ivan

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to