Hello, I run Corosync + Pacemaker + DRBD in a two node cluster, where all resources are part of a group/colocated with DRBD (DRBD + virtual IP + filesystem + ...). To test my configuration, I currently have two nodes with only a single disk drive. This drive is the only LVM physical drive in a LVM volume group, where the Linux system resides on some logical volumes and the disk exported by DRBD is another logical volume.
When I now unplug power of the disk drive on the node running the resources (DRBD is primary), this gets noticed by DRBD ("diskless"). Furthermore, I notice that my services do not work anymore (which is understandable without a working disk drive). However, in my experiments one of the following problems occurs: 1) The services are stopped and DRBD is demoted (according to pcs status and pacemaker.log), however according to /proc/drbd on the surviving node, the diskless node still is running as primary. As a consequence, I see failing attempts to promote on the surviver node: drbd(DRBD)[1797]: 2014/09/23_14:35:56 ERROR: disk0: Called drbdadm -c /etc/drbd.conf primary disk0 drbd(DRBD)[1797]: 2014/09/23_14:35:56 ERROR: disk0: Exit code 11 The problem here seems to be: crmd: info: match_graph_event: Action DRBD_demote_0 (12) confirmed on diskless_node (rc=0) While this demote operation obviously should not be confirmed, I also strongly believe that running the stop operations of the standard resources works without having access to the resource agent scripts (which are on the failed disk) and the tools used by them. 2) My services do not work anymore, but nothing happens in the cluster. Everything looks like it did before the failure, with the only difference that /proc/drbd shows "Diskless" and some "oos". It seems corosync/pacemaker sends out "all is well" to the DC, while internally (due to the missing disk) nothing works. I guess that running all sorts of monitor scripts is problematic without having access to the actual files, so I'd like to see some sort of failure communicated from the diskless node to the surviving node (or, having the surviving node come to the same conclusion due to some timeout). Is this buggy behaviour? How should a node behave if all disks stopped working? I can reproduce this. If you need details about the configuration or more output from pacemaker.log, please just tell me so. The versions reported by Centos 7: corosync 2.3.3-2.el7 pacemaker 1.1.10-32.el7_0 drbd 8.4.5-1.el7.elrepo Thank you, Carsten -- andrena objects ag Büro Frankfurt Clemensstr. 8 60487 Frankfurt Tel: +49 (0) 69 977 860 38 Fax: +49 (0) 69 977 860 39 http://www.andrena.de Vorstand: Hagen Buchwald, Matthias Grund, Dr. Dieter Kuhn Aufsichtsratsvorsitzender: Rolf Hetzelberger Sitz der Gesellschaft: Karlsruhe Amtsgericht Mannheim, HRB 109694 USt-IdNr. DE174314824 Bitte beachten Sie auch unsere anstehenden Veranstaltungen: http://www.andrena.de/events
signature.asc
Description: Digital signature
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org