----- Original Message ----- > From: "Jürgen Herrmann" <juergen.herrm...@xlhost.de> > To: "Jake Smith" <jsm...@argotec.com>, "The Pacemaker cluster resource > manager" <pacemaker@oss.clusterlabs.org> > Sent: Tuesday, February 5, 2013 4:00:48 PM > Subject: Re: [Pacemaker] Dual primary drbd resouce not promoted on one host > > Am 05.02.2013 16:32, schrieb Jake Smith: > > ----- Original Message ----- > >> From: "Jürgen Herrmann" <juergen.herrm...@xlhost.de> > >> To: pacemaker@oss.clusterlabs.org > >> Sent: Tuesday, February 5, 2013 7:04:26 AM > >> Subject: [Pacemaker] Dual primary drbd resouce not promoted on one > >> host > >> > >> Hi there! > >> > >> I have the following problem: > >> > >> I have a 2 node cluster with a dual primary drbd resource. On top > >> of it sits an OCFS2 file system. nodes: app1a, app1b > >> > >> Now today I had the following scenario (occurred several times > >> now): > >> - crm node standby app1a > >> - poweroff app1a for hdd replacement (hw raid controller) > >> - poweron app1a > >> - crm node online app1a > >> > >> all the other resources come back up as expecte, expect the master > >> slave set for the dual primary drbd. > >> > >> here's the relevant portion of my cluster config: > >> > >> node app1a.xlhost.de \ > >> attributes standby="off" > >> node app1b.xlhost.de \ > >> attributes standby="off" > >> primitive resDLM ocf:pacemaker:controld \ > >> op start interval="0" timeout="90s" \ > >> op stop interval="0" timeout="100s" \ > >> op monitor interval="120s" > >> primitive resDRBD0 ocf:linbit:drbd \ > >> op monitor interval="23" role="Slave" timeout="30" \ > >> op monitor interval="13" role="Master" timeout="20" \ > >> op start interval="0" timeout="240s" \ > >> op promote interval="0" timeout="240s" \ > >> op demote interval="0" timeout="100s" \ > >> op stop interval="0" timeout="100s" \ > >> params drbd_resource="drbd0" > >> primitive resFSDRBD0 ocf:heartbeat:Filesystem \ > >> params device="/dev/drbd0" directory="/mnt/drbd0" > >> fstype="ocfs2" options="noatime,intr,nodiratime,heartbeat=none" \ > >> op monitor interval="120s" timeout="50s" \ > >> op start interval="0" timeout="70s" \ > >> op stop interval="0" timeout="70s" > >> primitive resO2CB ocf:pacemaker:o2cb \ > >> op start interval="0" timeout="90s" \ > >> op stop interval="0" timeout="100s" \ > >> op monitor interval="120s" > >> ms msDRBD0 resDRBD0 \ > >> meta master-max="2" master-node-max="1" clone-max="2" > >> clone-node-max="1" notify="true" target-role="Master" > >> clone cloneDLM resDLM \ > >> meta globally-unique="false" interleave="true" > >> target-role="Started" > >> clone cloneFSDRBD0 resFSDRBD0 \ > >> meta interleave="true" globally-unique="false" > >> target-role="Started" > >> clone cloneO2CB resO2CB \ > >> meta globally-unique="false" interleave="true" > >> target-role="Started" > >> colocation colFSDRBD0_DRBD0 inf: cloneFSDRBD0 msDRBD0:Master > > > > ^^^ This colocation should be cloneDLM on msDRBD0. > > > >> colocation colFSDRBD0_O2CB inf: cloneFSDRBD0 cloneO2CB > >> colocation colO2CB_DLM inf: cloneO2CB cloneDLM > >> order ordDLM_FSDRBD0 inf: cloneDLM cloneFSDRBD0 > > > > ^^^ This order statement is not needed. > > > >> order ordDLM_O2CB inf: cloneDLM cloneO2CB > >> order ordDRBD0_FSDRBD0 inf: msDRBD0:promote cloneFSDRBD0 > > > > ^^^ This order should be msDRBD0:promote then cloneDLM:start > > > > If you explicitly define the action in an order statement for the > > resource then the same action is implied for the rest of the > > resources. So your statement is going to try to promote > > cloneFSDRBD0. > > You should define both actions explicitly like this: > > > > order ordDRBD0_DLM inf: msDRBD0:promote cloneDLM:start > > > >> order ordO2CB_FSDRBD0 inf: cloneO2CB cloneFSDRBD0 > >> > > > > > >> if i take down both nodes and fire them up again, everything goes > >> back > >> to normal and msDRBD0 is promoted to master on both nodes. > >> > >> I suspect this has something to do with ordering or colocation > >> constraints > >> but i'm not sure though. i've been staring at this problem for > >> dozens > >> of > >> times now and a vast amount of googling did not turn up my > >> specific > >> problem either. > > > > I'm pretty sure you are correct. I haven't used/tested OCFS on > > Pacemaker in awhile but I believe this is the correct > > ordering/collocation you're looking for (same as my notes above): > > > > Order - DRBD:promote then DLM:start then O2CB:start then FS:start > > Collocation - FS on O2CB on DLM on DRBD:master > > > > Hi Jake! > > Thanks very much for your comments! > > To sum it up i rewrote all six order/colo statements here: > > colocation colDLM_DRBD0 inf: cloneDLM msDRBD0:Master > colocation colO2CB_DLM inf: cloneO2CB cloneDLM > colocation colFSDRBD0_O2CB inf: cloneFSDRBD0 cloneO2CB > > order ordDRBD0_DLM inf: msDRBD0:promote cloneDLM:start > order ordDLM_O2CB inf: cloneDLM cloneO2CB > order ordO2CB_FSDRBD0 inf: cloneO2CB cloneFSDRBD0 > > will try this sometime in the upcoming nights and will report back, > maybe in the meantime you could have a look at the statements again > to doublecheck? thanks in advance.
Looks good to me (assuming I recall correctly that dlm needs to start before o2cb). > > best regards, > Jürgen Herrmann > > -- > >> XLhost.de ® - Webhosting von supersmall bis eXtra Large << > > XLhost.de GmbH > Jürgen Herrmann, Geschäftsführer > Boelckestrasse 21, 93051 Regensburg, Germany > > Geschäftsführer: Jürgen Herrmann > Registriert unter: HRB9918 > Umsatzsteuer-Identifikationsnummer: DE245931218 > > Fon: +49 (0)800 XLHOSTDE [0800 95467833] > Fax: +49 (0)800 95467830 > Web: http://www.XLhost.de > > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org