I'll give the technical details in a moment, but I thought I'd start with a
description of the problem.

I have a two-node active/passive cluster, with DRBD controlled by Pacemaker. I
upgraded to DRBD 8.4.x about six months ago (probably too soon); everything was
fine. Then last week we did some power-outage tests on our cluster.

Each node in the cluster is attached to its own uninterruptible power supply;
the STONITH mechanism is to turn off the other node's UPS. In the event of an
extended power outage (this happens 2-3 times a year at my site), it's likely
that one node will STONITH the other when the other node's UPS runs out of power
and shuts it down. This means that when power comes back on, only one node will
come back up, since the STONITHed UPS won't turn on again without manual
intervention.

The problem is that with only one node, Pacemaker+DRBD won't promote the DRBD
resource to primary; it just sits there at secondary and won't start up any
DRBD-dependent resources. Only when the second node comes back up will Pacemaker
assign one of them the primary role. I've confirmed this by shutting down
corosync on both nodes, then bringing it up again on just one of them.

I'm pretty sure that this is due to a mistake I"ve made in made in my DRBD
configuration when I fiddled with it during the 8.4.x upgrade. I've attached the
files. Can one of you kind folks spot the error?

Technical details:

Two-node configuration: hypatia and orestes
OS: Scientific Linux 5.5, kernel 2.6.18-238.19.1.el5xen
Packages:
drbd-8.4.1-1
corosync-1.2.7-1.1.el5
pacemaker-1.0.12-1.el5.centos
openais-1.1.3-1.6.el5

Attached: global_common.conf, nevis-admin.res

-- 
Bill Seligman             | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://[email protected]
PO Box 137                |
Irvington NY 10533 USA    | http://www.nevis.columbia.edu/~seligman/
global {
        usage-count yes;
}

common {
        handlers {
                pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; 
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot 
-f";
                pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; 
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot 
-f";
                local-io-error "/usr/lib/drbd/notify-io-error.sh; 
/usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt 
-f";
        }

        startup {
        }

        options {
        }

        disk {
                resync-rate 15M;
                al-extents 257;
        }

        net {
                protocol A;
                ping-timeout 11;
        }
}
resource admin {

  on hypatia.nevis.columbia.edu {
      volume 0 {
          device       /dev/drbd1;
          disk         /dev/md2;
          meta-disk    internal;
      }
      address          ipv4 192.168.100.7:7789;
  }
  on orestes.nevis.columbia.edu {
      volume 0 {
          device       /dev/drbd1;
          disk         /dev/md2;
          meta-disk    internal;
      }
      address          ipv4 192.168.100.6:7789;
  }

  net {
    after-sb-0pri discard-least-changes;
    after-sb-1pri consensus;
    after-sb-2pri disconnect;
  }

  startup {
    wfc-timeout 60;
    degr-wfc-timeout 60;
    outdated-wfc-timeout 60;
  }

  disk {
    fencing resource-only;
  }

  handlers {
    split-brain "/usr/lib/drbd/notify-split-brain.sh 
[email protected]";
    fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
    after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
  }
}

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to