You don't have real fencing configured, by the looks of it. Without real, working fencing, recovery can be unpredictable. Can you set that up and see if the problem goes away?

digimer

On 23/09/14 09:59 AM, Carsten Otto wrote:
On Tue, Sep 23, 2014 at 09:50:12AM -0400, Digimer wrote:
Can you share your pacemaker and drbd configurations please?

drbd.d/global_comman.conf:
global {
   usage-count no;
}

common {
   protocol C;
   handlers {
     split-brain "/usr/lib/drbd/notify-split-brain.sh root";
     out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
   }
}

drbd.d/disk0.res:
resource disk0 {
         syncer {
                 rate 10M;
                 csums-alg sha1;
         }
         disk {
                 on-io-error detach;
                 fencing resource-only;
         }
         handlers {
                 before-resync-target 
"/usr/lib/drbd/snapshot-resync-target-lvm.sh";
                 after-resync-target "/usr/lib/drbd/unsnapshot-resync-target-lvm.sh; 
/usr/lib/drbd/crm-unfence-peer.sh";
                 fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
                 split-brain "/usr/lib/drbd/notify-split-brain.sh root";
         }
         net {
                 after-sb-0pri discard-younger-primary;
                 after-sb-1pri discard-secondary;
                 after-sb-2pri call-pri-lost-after-sb;
         }
         device    /dev/drbd0;
         disk      /dev/centos/drbd-lv;
         meta-disk internal;
         on node_a {
                 address   192.168.69.89:7789;
         }
         on node_b {
                 address   192.168.69.90:7789;

         }
}

pcs resource --full:
  Master: DRBD_MASTER
   Meta Attrs: master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 
notify=true failure-timeout=60sec
   Resource: DRBD (class=ocf provider=linbit type=drbd)
    Attributes: drbd_resource=disk0
    Meta Attrs: failure-timeout=60sec
    Operations: start interval=0s timeout=240 (DRBD-start-timeout-240)
                promote interval=0s timeout=90 (DRBD-promote-timeout-90)
                demote interval=0s timeout=90 (DRBD-demote-timeout-90)
                stop interval=0s timeout=100 (DRBD-stop-timeout-100)
                monitor interval=9 role=Master (DRBD-monitor-interval-9)
                monitor interval=11 role=Slave (DRBD-monitor-interval-11)
  Group: GROUP
   Resource: VIP (class=ocf provider=heartbeat type=IPaddr2)
    Attributes: ip=192.168.69.48 cidr_netmask=32
    Meta Attrs: failure-timeout=60sec
    Operations: start interval=0s timeout=20s (VIP-start-timeout-20s)
                stop interval=0s timeout=5s (VIP-stop-timeout-5s)
                monitor interval=10sec (VIP-monitor-interval-10sec)
   Resource: FS (class=ocf provider=heartbeat type=Filesystem)
    Attributes: device=/dev/drbd0 directory=/mnt/drbd 
options=noatime,nodiratime fstype=ext4
    Meta Attrs: failure-timeout=60sec
    Operations: start interval=0s timeout=60 (FS-start-timeout-60)
                stop interval=0s timeout=10s (FS-stop-timeout-10s)
                monitor interval=5sec (FS-monitor-interval-5sec)
   Resource: PGSQL (class=ocf provider=heartbeat type=pgsql)
    Meta Attrs: failure-timeout=60sec
    Operations: start interval=0s timeout=120 (PGSQL-start-timeout-120)
                stop interval=0s timeout=120 (PGSQL-stop-timeout-120)
                promote interval=0s timeout=120 (PGSQL-promote-timeout-120)
                demote interval=0s timeout=120 (PGSQL-demote-timeout-120)
                monitor interval=10sec (PGSQL-monitor-interval-10sec)
   Resource: ASTERISK (class=ocf provider=heartbeat type=asterisk)
    Meta Attrs: failure-timeout=60sec
    Operations: start interval=0s timeout=20 (ASTERISK-start-timeout-20)
                monitor interval=10sec (ASTERISK-monitor-interval-10sec)
                stop interval=0s timeout=1 (ASTERISK-stop-timeout-1)
   Resource: TOMCAT (class=ocf provider=heartbeat type=tomcat)
    Attributes: java_home=/usr/java/latest/ catalina_home=/usr/share/tomcat 
statusurl=http://localhost:8080/xxx/
    Meta Attrs: failure-timeout=60sec
    Operations: start interval=0s timeout=60s (TOMCAT-start-timeout-60s)
                stop interval=0s timeout=20s (TOMCAT-stop-timeout-20s)
                monitor interval=10sec (TOMCAT-monitor-interval-10sec)

pcs constraint --full:
Location Constraints:
   Resource: DRBD_MASTER
     Constraint: drbd-fence-by-handler-disk0-DRBD_MASTER
       Rule: score=-INFINITY role=Master  
(id:drbd-fence-by-handler-disk0-rule-DRBD_MASTER)
         Expression: #uname ne node_a  
(id:drbd-fence-by-handler-disk0-expr-DRBD_MASTER)
   Resource: STONITH_A
     Disabled on: node_b (score:-INFINITY) 
(id:location-STONITH_A-node_b--INFINITY)
Ordering Constraints:
   promote DRBD_MASTER then start GROUP (Mandatory) 
(id:order-DRBD_MASTER-GROUP-mandatory)
Colocation Constraints:
   GROUP with DRBD_MASTER (INFINITY) (rsc-role:Started) (with-rsc-role:Master) 
(id:colocation-GROUP-DRBD_MASTER-INFINITY)

pcs stonith --full:
  STONITH_A  (stonith:fence_dummy):  Started
  Resource: STONITH_A (class=stonith type=fence_dummy)
   Attributes: passwd=x pcmk_host_list=node_b
   Operations: monitor interval=60s (STONITH_A-monitor-interval-60s)

[Note: The problem also happens without stonith and with a proper stonith
configuration on both nodes!]

pcs property:
Cluster Properties:
  cluster-infrastructure: corosync
  cluster-recheck-interval: 5min
  dc-version: 1.1.10-32.el7_0-368c726
  last-lrm-refresh: 1411475550
  no-quorum-policy: ignore
  stonith-enabled: true



--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without access to education?

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to