On 09/09/2013, at 6:46 PM, Heikki Manninen <h...@iki.fi> wrote: > Hello Andreas, thanks for your input, much appreciated. > > On 5.9.2013, at 16.39, "Andreas Mock" <andreas.m...@web.de> wrote: > >> 1) The second output of crm_mon show a resource IP_database >> which is not shown in the initial crm_mon output and also >> not in the config. => Reduce your problem/config to the >> minimum being reproducible. > > True. I edited out the resource from the e-mail that did not have anything to > do with the problem as such (works ok all the time). Just forgot to remove it > from the second copy-paste also. And yes, no more IP resource in the > configuration. > >> 2) Enable logging and look out which node is the DC. >> There in the logs you find many many informations showing >> what is going on. Hint: Open a terminal session with an >> opened tail -f logfile. Watch it while inserting commands. >> You'll get used to it. > > Seems that node #2 was the DC (also visible in the pcs status output). I have > looked at the logs all the time, just not yet too familiar with the contents > of pacemaker logging. Here's the thing that keeps repeating everytime those > LVM and FS resources stay in stopped state: > > Sep 3 20:01:23 pgdbsrv02 pengine[1667]: notice: LogActions: Start > LVM_vgdata01#011(pgdbsrv01.cl1.local - blocked) > Sep 3 20:01:23 pgdbsrv02 pengine[1667]: notice: LogActions: Start > FS_data01#011(pgdbsrv01.cl1.local - blocked) > Sep 3 20:01:23 pgdbsrv02 pengine[1667]: notice: LogActions: Start > LVM_vgdata02#011(pgdbsrv01.cl1.local - blocked) > Sep 3 20:01:23 pgdbsrv02 pengine[1667]: notice: LogActions: Start > FS_data02#011(pgdbsrv01.cl1.local - blocked) > > So what does blocked mean here?
It means we'd like to start those resources on pgdbsrv01.cl1.local but something has to happen first - possibly some other resource needs to start but can't. > Is it that the node #1 in this case is in need of fencing/stonithing and thus > being blocked or something else (I have a backgroud in the > RHCS/HACMP/LifeKeeper etc. world). No quorum policy is set to ignore. > >> 3) The shown status of a drbd resource (crm_mon) doesn't show >> you all informations of the drbd devices. Have a look at >> drbd-overview on both nodes. (e.g. syncing status). > > True, DRBD is working fine on these occations. Connected, Synced etc. > >> 4) This setup CRIES for stonithing. Even in a test environment. >> When stonith happens (this is what you see immediately) you >> know something went wrong. This is a good indicator for >> errors in agents or in the config. Believe me, as tedious stonithing >> is the valuable it is for getting hints for bad cluster state. >> On virtual machines stonithing is not as painful as on real >> servers. > > Very much true. I have implemented some custom fencing/stonithing agents > before on physical and virtual cluster environments. Problem being here is > that I'm not aware of reasonably simple ways to implement stonith with VMware > Fusion that I'm bound to use for this test setup. Have to dig more into this > though. So fencing from cman cluster.conf is chained to pacemaker fencing and > pacemaker stonithing is disabled, no quorum policy is ignore. > >> 5) Is the drbd fencing script enabled? If yes, in certain circumstances >> -INF rules are inserted to deny promoting of "wrong" nodes. >> You should grep for them 'cibadmin -Q | grep <resname>' > > No, DRBD fencing is not enabled and split-brain recovery is done manually. > >> 6) crm_simulate -L -v gives you an output of the scores of >> the resources on each node. I really don't know how to read it >> exactly (Is there a documentation of that anywhere?), but it >> gives you a hint where to look at, when resources don't start. >> Especially the aggregation of stickiness values in groups are >> sometimes misleading. > > Could be that I have some different version maybe, because -v is unknown > option and: > > # crm_simulate -L -V > > Current cluster status: > Online: [ pgdbsrv01.cl1.local pgdbsrv02.cl1.local ] > > Master/Slave Set: DRBD_ms_data01 [DRBD_data01] > Masters: [ pgdbsrv01.cl1.local ] > Slaves: [ pgdbsrv02.cl1.local ] > Master/Slave Set: DRBD_ms_data02 [DRBD_data02] > Masters: [ pgdbsrv01.cl1.local ] > Slaves: [ pgdbsrv02.cl1.local ] > Resource Group: GRP_data01 > LVM_vgdata01 (ocf::heartbeat:LVM): Stopped > FS_data01 (ocf::heartbeat:Filesystem): Stopped > Resource Group: GRP_data02 > LVM_vgdata02 (ocf::heartbeat:LVM): Stopped > FS_data02 (ocf::heartbeat:Filesystem): Stopped > > > Only shows that much. > > Original problem description left quoted below. > > > -- > Heikki M > > >> -----Ursprüngliche Nachricht----- >> Von: Heikki Manninen [mailto:h...@iki.fi] >> Gesendet: Donnerstag, 5. September 2013 14:08 >> An: pacemaker@oss.clusterlabs.org >> Betreff: [Pacemaker] Resource ordering/colocating question (DRBD + LVM + FS) >> >> Hello, >> >> I'm having a bit of a problem understanding what's going on with my simple >> two-node demo cluster here. My resources come up correctly after restarting >> the whole cluster but the LVM and Filesystem resources fail to start after a >> single node restart or standby/unstandby (after node comes back online - why >> do they even stop/start after the second node comes back?). >> >> OS: CentOS 6.4 (cman stack) >> Pacemaker: pacemaker-1.1.8-7.el6.x86_64 >> DRBD: drbd84-utils-8.4.3-1.el6.elrepo.x86_64 >> >> Everything is configured using: pcs-0.9.26-10.el6_4.1.noarch >> >> Two DRBD resources configured and working: data01 & data02 >> Two nodes: pgdbsrv01.cl1.local & pgdbsrv02.cl1.local >> >> Configuration: >> >> node pgdbsrv01.cl1.local >> node pgdbsrv02.cl1.local >> primitive DRBD_data01 ocf:linbit:drbd \ >> params drbd_resource="data01" \ >> op monitor interval="30s" >> primitive DRBD_data02 ocf:linbit:drbd \ >> params drbd_resource="data02" \ >> op monitor interval="30s" >> primitive FS_data01 ocf:heartbeat:Filesystem \ >> params device="/dev/mapper/vgdata01-lvdata01" directory="/data01" >> fstype="ext4" \ >> op monitor interval="30s" >> primitive FS_data02 ocf:heartbeat:Filesystem \ >> params device="/dev/mapper/vgdata02-lvdata02" directory="/data02" >> fstype="ext4" \ >> op monitor interval="30s" >> primitive LVM_vgdata01 ocf:heartbeat:LVM \ >> params volgrpname="vgdata01" exclusive="true" \ >> op monitor interval="30s" >> primitive LVM_vgdata02 ocf:heartbeat:LVM \ >> params volgrpname="vgdata02" exclusive="true" \ >> op monitor interval="30s" >> group GRP_data01 LVM_vgdata01 FS_data01 >> group GRP_data02 LVM_vgdata02 FS_data02 >> ms DRBD_ms_data01 DRBD_data01 \ >> meta master-max="1" master-node-max="1" clone-max="2" >> clone-node-max="1" notify="true" >> ms DRBD_ms_data02 DRBD_data02 \ >> meta master-max="1" master-node-max="1" clone-max="2" >> clone-node-max="1" notify="true" >> colocation colocation-GRP_data01-DRBD_ms_data01-INFINITY inf: GRP_data01 >> DRBD_ms_data01:Master >> colocation colocation-GRP_data02-DRBD_ms_data02-INFINITY inf: GRP_data02 >> DRBD_ms_data02:Master >> order order-DRBD_data01-GRP_data01-mandatory : DRBD_data01:promote >> GRP_data01:start >> order order-DRBD_data02-GRP_data02-mandatory : DRBD_data02:promote >> GRP_data02:start >> property $id="cib-bootstrap-options" \ >> dc-version="1.1.8-7.el6-394e906" \ >> cluster-infrastructure="cman" \ >> stonith-enabled="false" \ >> no-quorum-policy="ignore" \ >> migration-threshold="1" >> rsc_defaults $id="rsc_defaults-options" \ >> resource-stickiness="100" >> >> >> 1) After starting the cluster, everything runs happily: >> >> Last updated: Tue Sep 3 00:11:13 2013 >> Last change: Tue Sep 3 00:05:15 2013 via cibadmin on pgdbsrv01.cl1.local >> Stack: cman >> Current DC: pgdbsrv02.cl1.local - partition with quorum >> Version: 1.1.8-7.el6-394e906 >> 2 Nodes configured, unknown expected votes >> 9 Resources configured. >> >> Online: [ pgdbsrv01.cl1.local pgdbsrv02.cl1.local ] >> >> Full list of resources: >> >> Master/Slave Set: DRBD_ms_data01 [DRBD_data01] >> Masters: [ pgdbsrv01.cl1.local ] >> Slaves: [ pgdbsrv02.cl1.local ] >> Master/Slave Set: DRBD_ms_data02 [DRBD_data02] >> Masters: [ pgdbsrv01.cl1.local ] >> Slaves: [ pgdbsrv02.cl1.local ] >> Resource Group: GRP_data01 >> LVM_vgdata01 (ocf::heartbeat:LVM): Started pgdbsrv01.cl1.local >> FS_data01 (ocf::heartbeat:Filesystem): Started pgdbsrv01.cl1.local >> Resource Group: GRP_data02 >> LVM_vgdata02 (ocf::heartbeat:LVM): Started pgdbsrv01.cl1.local >> FS_data02 (ocf::heartbeat:Filesystem): Started pgdbsrv01.cl1.local >> >> 2) Putting node #1 to standby mode - after which everything runs happily on >> node pgdbsrv02.cl1.local >> >> # pcs cluster standby pgdbsrv01.cl1.local >> # pcs status >> Last updated: Tue Sep 3 00:16:01 2013 >> Last change: Tue Sep 3 00:15:55 2013 via crm_attribute on >> pgdbsrv02.cl1.local >> Stack: cman >> Current DC: pgdbsrv02.cl1.local - partition with quorum >> Version: 1.1.8-7.el6-394e906 >> 2 Nodes configured, unknown expected votes >> 9 Resources configured. >> >> >> Node pgdbsrv01.cl1.local: standby >> Online: [ pgdbsrv02.cl1.local ] >> >> Full list of resources: >> >> IP_database (ocf::heartbeat:IPaddr2): Started pgdbsrv02.cl1.local >> Master/Slave Set: DRBD_ms_data01 [DRBD_data01] >> Masters: [ pgdbsrv02.cl1.local ] >> Stopped: [ DRBD_data01:1 ] >> Master/Slave Set: DRBD_ms_data02 [DRBD_data02] >> Masters: [ pgdbsrv02.cl1.local ] >> Stopped: [ DRBD_data02:1 ] >> Resource Group: GRP_data01 >> LVM_vgdata01 (ocf::heartbeat:LVM): Started pgdbsrv02.cl1.local >> FS_data01 (ocf::heartbeat:Filesystem): Started >> pgdbsrv02.cl1.local >> Resource Group: GRP_data02 >> LVM_vgdata02 (ocf::heartbeat:LVM): Started pgdbsrv02.cl1.local >> FS_data02 (ocf::heartbeat:Filesystem): Started >> pgdbsrv02.cl1.local >> >> 3) Putting node #1 back online - it seems that all the resources stop (?) >> and then DRBD gets promoted successfully on node #2 but LVM and FS resources >> never start >> >> # pcs cluster unstandby pgdbsrv01.cl1.local >> # pcs status >> Last updated: Tue Sep 3 00:17:00 2013 >> Last change: Tue Sep 3 00:16:56 2013 via crm_attribute on >> pgdbsrv02.cl1.local >> Stack: cman >> Current DC: pgdbsrv02.cl1.local - partition with quorum >> Version: 1.1.8-7.el6-394e906 >> 2 Nodes configured, unknown expected votes >> 9 Resources configured. >> >> >> Online: [ pgdbsrv01.cl1.local pgdbsrv02.cl1.local ] >> >> Full list of resources: >> >> IP_database (ocf::heartbeat:IPaddr2): Started pgdbsrv02.cl1.local >> Master/Slave Set: DRBD_ms_data01 [DRBD_data01] >> Masters: [ pgdbsrv02.cl1.local ] >> Slaves: [ pgdbsrv01.cl1.local ] >> Master/Slave Set: DRBD_ms_data02 [DRBD_data02] >> Masters: [ pgdbsrv02.cl1.local ] >> Slaves: [ pgdbsrv01.cl1.local ] >> Resource Group: GRP_data01 >> LVM_vgdata01 (ocf::heartbeat:LVM): Stopped >> FS_data01 (ocf::heartbeat:Filesystem): Stopped >> Resource Group: GRP_data02 >> LVM_vgdata02 (ocf::heartbeat:LVM): Stopped >> FS_data02 (ocf::heartbeat:Filesystem): Stopped >> >> >> >> Any ideas why this is happening/what could be wrong in the resource >> configuration? The same thing happens when testing the situation with the >> resources located vice-versa in the beginning. Also, if I stop & start one >> of the nodes, same thing happens once the node gets back online. >> >> >> -- >> Heikki Manninen <h...@iki.fi> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org