Hi,
I am trying to build an Active/Passive OpenNMS installation but have problems with failover and failback. Doing all manually (DRBD, VIP, Filesystem, ...) works just fine. I read many tutorials, the official book and the FAQ but am still stuck with my problem. I use Ubuntu 10.04 with "pacemaker" and "corosync". Hosts are 2 HP ProLiant G6 and are connected through a Cisco Switch. The switch was configured to allow Multicast as described in[1]. Hostnames are "monitoring-node-01" and "monitoring-node-02" and it seems I can failover to "monitoring-node-02" but not back. The DRBD Init Script is disabled on both nodes. First my current config: crm configure show node monitoring-node-01 \ attributes standby="off" node monitoring-node-02 \ attributes standby="off" primitive drbd-opennms-config ocf:linbit:drbd \ params drbd_resource="config" \ op start interval="0" timeout="300s" \ op stop interval="0" timeout="300s" primitive drbd-opennms-data ocf:linbit:drbd \ params drbd_resource="data" \ op start interval="0" timeout="300s" \ op stop interval="0" timeout="300s" primitive drbd-opennms-db ocf:linbit:drbd \ params drbd_resource="db" \ op start interval="0" timeout="300s" \ op stop interval="0" timeout="300s" primitive fs-opennms-config ocf:heartbeat:Filesystem \ params device="/dev/drbd/by-res/config" directory="/etc/opennms" fstype="xfs" \ op stop interval="0" timeout="300s" \ op start interval="0" timeout="300s" primitive fs-opennms-data ocf:heartbeat:Filesystem \ params device="/dev/drbd/by-res/data" directory="/var/lib/opennms" fstype="xfs" \ op stop interval="0" timeout="300s" \ op start interval="0" timeout="300s" primitive fs-opennms-db ocf:heartbeat:Filesystem \ params device="/dev/drbd/by-res/db" directory="/var/lib/postgresql" fstype="xfs" \ op stop interval="0" timeout="300s" \ op start interval="0" timeout="300s" primitive opennms lsb:opennms \ op start interval="0" timeout="300s" \ op stop interval="0" timeout="300s" \ meta target-role="Started" primitive postgres lsb:postgresql-8.4 primitive vip ocf:heartbeat:IPaddr2 \ params ip="172.24.25.20" cidr_netmask="24" nic="bond0" group dependencies vip fs-opennms-config fs-opennms-db fs-opennms-data postgres opennms ms ms-opennms-config drbd-opennms-config \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" ms ms-opennms-data drbd-opennms-data \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" ms ms-opennms-db drbd-opennms-db \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" order promote-before-drbd-config inf: ms-opennms-config:promote fs-opennms-config:start order promote-before-drbd-data inf: ms-opennms-data:promote fs-opennms-data:start order promote-before-drbd-db inf: ms-opennms-db:promote fs-opennms-db:start property $id="cib-bootstrap-options" \ dc-version="1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd" \ cluster-infrastructure="openais" \ expected-quorum-votes="2" \ stonith-enabled="false" \ no-quorum-policy="ignore" \ last-lrm-refresh="1278329430" rsc_defaults $id="rsc-options" \ resource-stickiness="100" Current state if all works: monitoring-node-01 cat /proc/drbd version: 8.3.7 (api:88/proto:86-91) GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by r...@monitoring-node-01, 2010-06-22 20:00:41 0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r---- ns:0 nr:0 dw:2052 dr:29542 al:2 bm:8 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r---- ns:0 nr:0 dw:0 dr:200 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 2: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate A r---- ns:0 nr:0 dw:0 dr:200 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 df -h Filesystem Size Used Avail Use% Mounted on /dev/cciss/c0d0p1 270G 3.1G 253G 2% / none 16G 212K 16G 1% /dev none 16G 3.5M 16G 1% /dev/shm none 16G 84K 16G 1% /var/run none 16G 12K 16G 1% /var/lock none 16G 0 16G 0% /lib/init/rw /dev/drbd0 10G 34M 10G 1% /etc/opennms /dev/drbd1 50G 67M 50G 1% /var/lib/postgresql /dev/drbd2 214G 178M 214G 1% /var/lib/opennms crm_mon ============ Last updated: Mon Jul 5 13:02:54 2010 Stack: openais Current DC: monitoring-node-02 - partition with quorum Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd 2 Nodes configured, 2 expected votes 4 Resources configured. ============ Online: [ monitoring-node-01 monitoring-node-02 ] Master/Slave Set: ms-opennms-data Masters: [ monitoring-node-01 ] Slaves: [ monitoring-node-02 ] Master/Slave Set: ms-opennms-db Masters: [ monitoring-node-01 ] Slaves: [ monitoring-node-02 ] Master/Slave Set: ms-opennms-config Masters: [ monitoring-node-01 ] Slaves: [ monitoring-node-02 ] Resource Group: dependencies vip (ocf::heartbeat:IPaddr2): Started monitoring-node-01 fs-opennms-config (ocf::heartbeat:Filesystem): Started monitoring-node-01 fs-opennms-db (ocf::heartbeat:Filesystem): Started monitoring-node-01 fs-opennms-data (ocf::heartbeat:Filesystem): Started monitoring-node-01 postgres (lsb:postgresql-8.4): Started monitoring-node-01 opennms (lsb:opennms): Started monitoring-node-01 monitoring-node-02 cat /proc/drbd version: 8.3.7 (api:88/proto:86-91) GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by r...@monitoring-node-02, 2010-06-22 19:59:35 0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r---- ns:0 nr:0 dw:24620 dr:400 al:0 bm:4 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r---- ns:0 nr:302898 dw:302898 dr:400 al:0 bm:6 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:0 2: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate A r---- ns:0 nr:19712467 dw:19712467 dr:400 al:0 bm:1 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:0 df -h Filesystem Size Used Avail Use% Mounted on /dev/cciss/c0d0p1 270G 1.8G 254G 1% / none 16G 212K 16G 1% /dev none 16G 3.6M 16G 1% /dev/shm none 16G 72K 16G 1% /var/run none 16G 12K 16G 1% /var/lock none 16G 0 16G 0% /lib/init/rw Rebooting "monitoring-node01" seems to work. All services run now on "monitoring-node02" and were started in correct order. But when I reboot "monitoring-node02": crm_mon Last updated: Mon Jul 5 13:21:31 2010 Stack: openais Current DC: monitoring-node-01 - partition with quorum Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd 2 Nodes configured, 2 expected votes 4 Resources configured. ============ Online: [ monitoring-node-01 monitoring-node-02 ] Master/Slave Set: ms-opennms-data Masters: [ monitoring-node-01 ] Slaves: [ monitoring-node-02 ] Master/Slave Set: ms-opennms-db Masters: [ monitoring-node-01 ] Slaves: [ monitoring-node-02 ] Master/Slave Set: ms-opennms-config Masters: [ monitoring-node-01 ] Slaves: [ monitoring-node-02 ] Resource Group: dependencies vip (ocf::heartbeat:IPaddr2): Started monitoring-node-02 fs-opennms-config (ocf::heartbeat:Filesystem): Stopped fs-opennms-db (ocf::heartbeat:Filesystem): Stopped fs-opennms-data (ocf::heartbeat:Filesystem): Stopped postgres (lsb:postgresql-8.4): Stopped opennms (lsb:opennms): Stopped Failed actions: fs-opennms-config_start_0 (node=monitoring-node-02, call=18, rc=1, status=complete): unknown error fs-opennms-config_start_0 (node=m) When I then type the following command: crm resource cleanup fs-opennms-config then the failover goes through and services are started on node01 crm_mon ============ Last updated: Mon Jul 5 13:31:31 2010 Stack: openais Current DC: monitoring-node-01 - partition with quorum Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd 2 Nodes configured, 2 expected votes 4 Resources configured. ============ Online: [ monitoring-node-01 monitoring-node-02 ] Master/Slave Set: ms-opennms-data Masters: [ monitoring-node-01 ] Slaves: [ monitoring-node-02 ] Master/Slave Set: ms-opennms-db Masters: [ monitoring-node-01 ] Slaves: [ monitoring-node-02 ] Master/Slave Set: ms-opennms-config Masters: [ monitoring-node-01 ] Slaves: [ monitoring-node-02 ] Resource Group: dependencies vip (ocf::heartbeat:IPaddr2): Started monitoring-node-01 fs-opennms-config (ocf::heartbeat:Filesystem): Started monitoring-node-01 fs-opennms-db (ocf::heartbeat:Filesystem): Started monitoring-node-01 fs-opennms-data (ocf::heartbeat:Filesystem): Started monitoring-node-01 postgres (lsb:postgresql-8.4): Started monitoring-node-01 opennms (lsb:opennms): Started monitoring-node-01 There are some entries in my daemon.log[2] which look if they have something to do with my problem... monitoring-node-01 lrmd: [994]: info: rsc:drbd-opennms-data:1:30: promote monitoring-node-01 crmd: [998]: info: do_lrm_rsc_op: Performing key=93:25:0:d49b62be-1e33-48ca-a8c3-cb128676d444 op=fs-opennms-config_start_0 ) monitoring-node-01 lrmd: [994]: info: rsc:fs-opennms-config:31: start monitoring-node-01 Filesystem[2464]: INFO: Running start for /dev/drbd/by-res/config on /etc/opennms monitoring-node-01 lrmd: [994]: info: RA output: (fs-opennms-config:start:stderr) FATAL: Module scsi_hostadapter not found. monitoring-node-01 lrmd: [994]: info: RA output: (drbd-opennms-data:1:promote:stdout) monitoring-node-01 crmd: [998]: info: process_lrm_event: LRM operation drbd-opennms-data:1_promote_0 (call=30, rc=0, cib-update=35, confirmed=true) ok monitoring-node-01 lrmd: [994]: info: RA output: (fs-opennms-config:start:stderr) /dev/drbd/by-res/config: Wrong medium type monitoring-node-01 lrmd: [994]: info: RA output: (fs-opennms-config:start:stderr) mount: block device /dev/drbd0 is write-protected, mounting read-only monitoring-node-01 lrmd: [994]: info: RA output: (fs-opennms-config:start:stderr) mount: Wrong medium type monitoring-node-01 Filesystem[2464]: ERROR: Couldn't mount filesystem /dev/drbd/by-res/config on /etc/opennms monitoring-node-01 crmd: [998]: info: process_lrm_event: LRM operation fs-opennms-config_start_0 (call=31, rc=1, cib-update=36, confirmed=true) unknown error ...but I don't know how to troubleshoot it. [1] http://www.corosync.org/doku.php?id=faq:cisco_switches [2] http://pastebin.com/DKLjXtx8 _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker