Re: [Pacemaker] Resource ordering/colocating question (DRBD + LVM + FS)

Andrew Beekhof Thu, 12 Sep 2013 00:13:37 -0700

On 09/09/2013, at 6:46 PM, Heikki Manninen <h...@iki.fi> wrote:

> Hello Andreas, thanks for your input, much appreciated.
> 
> On 5.9.2013, at 16.39, "Andreas Mock" <andreas.m...@web.de> wrote:
> 
>> 1) The second output of crm_mon show a resource IP_database
>> which is not shown in the initial crm_mon output and also
>> not in the config. => Reduce your problem/config to the
>> minimum being reproducible.
> 
> True. I edited out the resource from the e-mail that did not have anything to 
> do with the problem as such (works ok all the time). Just forgot to remove it 
> from the second copy-paste also. And yes, no more IP resource in the 
> configuration.
> 
>> 2) Enable logging and look out which node is the DC.
>> There in the logs you find many many informations showing
>> what is going on. Hint: Open a terminal session with an
>> opened tail -f logfile. Watch it while inserting commands.
>> You'll get used to it.
> 
> Seems that node #2 was the DC (also visible in the pcs status output). I have 
> looked at the logs all the time, just not yet too familiar with the contents 
> of pacemaker logging. Here's the thing that keeps repeating everytime those 
> LVM and FS resources stay in stopped state:
> 
> Sep  3 20:01:23 pgdbsrv02 pengine[1667]:   notice: LogActions: Start   
> LVM_vgdata01#011(pgdbsrv01.cl1.local - blocked)
> Sep  3 20:01:23 pgdbsrv02 pengine[1667]:   notice: LogActions: Start   
> FS_data01#011(pgdbsrv01.cl1.local - blocked)
> Sep  3 20:01:23 pgdbsrv02 pengine[1667]:   notice: LogActions: Start   
> LVM_vgdata02#011(pgdbsrv01.cl1.local - blocked)
> Sep  3 20:01:23 pgdbsrv02 pengine[1667]:   notice: LogActions: Start   
> FS_data02#011(pgdbsrv01.cl1.local - blocked)
> 
> So what does blocked mean here?


It means we'd like to start those resources on pgdbsrv01.cl1.local but 
something has to happen first - possibly some other resource needs to start but 
can't.

> Is it that the node #1 in this case is in need of fencing/stonithing and thus 
> being blocked or something else (I have a backgroud in the 
> RHCS/HACMP/LifeKeeper etc. world). No quorum policy is set to ignore.
> 
>> 3) The shown status of a drbd resource (crm_mon) doesn't show
>> you all informations of the drbd devices. Have a look at
>> drbd-overview on both nodes. (e.g. syncing status).
> 
> True, DRBD is working fine on these occations. Connected, Synced etc.
> 
>> 4) This setup CRIES for stonithing. Even in a test environment.
>> When stonith happens (this is what you see immediately) you
>> know something went wrong. This is a good indicator for
>> errors in agents or in the config. Believe me, as tedious stonithing
>> is the valuable it is for getting hints for bad cluster state.
>> On virtual machines stonithing is not as painful as on real
>> servers.
> 
> Very much true. I have implemented some custom fencing/stonithing agents 
> before on physical and virtual cluster environments. Problem being here is 
> that I'm not aware of reasonably simple ways to implement stonith with VMware 
> Fusion that I'm bound to use for this test setup. Have to dig more into this 
> though. So fencing from cman cluster.conf is chained to pacemaker fencing and 
> pacemaker stonithing is disabled, no quorum policy is ignore.
> 
>> 5) Is the drbd fencing script enabled? If yes, in certain circumstances
>> -INF rules are inserted to deny promoting of "wrong" nodes.
>> You should grep for them 'cibadmin -Q | grep <resname>'
> 
> No, DRBD fencing is not enabled and split-brain recovery is done manually.
> 
>> 6) crm_simulate -L -v gives you an output of the scores of
>> the resources on each node. I really don't know how to read it
>> exactly (Is there a documentation of that anywhere?), but it
>> gives you a hint where to look at, when resources don't start.
>> Especially the aggregation of stickiness values in groups are
>> sometimes misleading.
> 
> Could be that I have some different version maybe, because -v is unknown 
> option and:
> 
> # crm_simulate -L -V
> 
> Current cluster status:
> Online: [ pgdbsrv01.cl1.local pgdbsrv02.cl1.local ]
> 
> Master/Slave Set: DRBD_ms_data01 [DRBD_data01]
>    Masters: [ pgdbsrv01.cl1.local ]
>    Slaves: [ pgdbsrv02.cl1.local ]
> Master/Slave Set: DRBD_ms_data02 [DRBD_data02]
>    Masters: [ pgdbsrv01.cl1.local ]
>    Slaves: [ pgdbsrv02.cl1.local ]
> Resource Group: GRP_data01
>    LVM_vgdata01       (ocf::heartbeat:LVM):   Stopped
>    FS_data01  (ocf::heartbeat:Filesystem):    Stopped
> Resource Group: GRP_data02
>    LVM_vgdata02       (ocf::heartbeat:LVM):   Stopped
>    FS_data02  (ocf::heartbeat:Filesystem):    Stopped
> 
> 
> Only shows that much.
> 
> Original problem description left quoted below.
> 
> 
> -- 
> Heikki M
> 
> 
>> -----Ursprüngliche Nachricht-----
>> Von: Heikki Manninen [mailto:h...@iki.fi] 
>> Gesendet: Donnerstag, 5. September 2013 14:08
>> An: pacemaker@oss.clusterlabs.org
>> Betreff: [Pacemaker] Resource ordering/colocating question (DRBD + LVM + FS)
>> 
>> Hello,
>> 
>> I'm having a bit of a problem understanding what's going on with my simple
>> two-node demo cluster here. My resources come up correctly after restarting
>> the whole cluster but the LVM and Filesystem resources fail to start after a
>> single node restart or standby/unstandby (after node comes back online - why
>> do they even stop/start after the second node comes back?).
>> 
>> OS: CentOS 6.4 (cman stack)
>> Pacemaker: pacemaker-1.1.8-7.el6.x86_64
>> DRBD: drbd84-utils-8.4.3-1.el6.elrepo.x86_64
>> 
>> Everything is configured using: pcs-0.9.26-10.el6_4.1.noarch
>> 
>> Two DRBD resources configured and working: data01 & data02
>> Two nodes: pgdbsrv01.cl1.local & pgdbsrv02.cl1.local
>> 
>> Configuration:
>> 
>> node pgdbsrv01.cl1.local
>> node pgdbsrv02.cl1.local
>> primitive DRBD_data01 ocf:linbit:drbd \
>>   params drbd_resource="data01" \
>>   op monitor interval="30s"
>> primitive DRBD_data02 ocf:linbit:drbd \
>>   params drbd_resource="data02" \
>>   op monitor interval="30s"
>> primitive FS_data01 ocf:heartbeat:Filesystem \
>>   params device="/dev/mapper/vgdata01-lvdata01" directory="/data01"
>> fstype="ext4" \
>>   op monitor interval="30s"
>> primitive FS_data02 ocf:heartbeat:Filesystem \
>>   params device="/dev/mapper/vgdata02-lvdata02" directory="/data02"
>> fstype="ext4" \
>>   op monitor interval="30s"
>> primitive LVM_vgdata01 ocf:heartbeat:LVM \
>>   params volgrpname="vgdata01" exclusive="true" \
>>   op monitor interval="30s"
>> primitive LVM_vgdata02 ocf:heartbeat:LVM \
>>   params volgrpname="vgdata02" exclusive="true" \
>>   op monitor interval="30s"
>> group GRP_data01 LVM_vgdata01 FS_data01
>> group GRP_data02 LVM_vgdata02 FS_data02
>> ms DRBD_ms_data01 DRBD_data01 \
>>   meta master-max="1" master-node-max="1" clone-max="2"
>> clone-node-max="1" notify="true"
>> ms DRBD_ms_data02 DRBD_data02 \
>>   meta master-max="1" master-node-max="1" clone-max="2"
>> clone-node-max="1" notify="true"
>> colocation colocation-GRP_data01-DRBD_ms_data01-INFINITY inf: GRP_data01
>> DRBD_ms_data01:Master
>> colocation colocation-GRP_data02-DRBD_ms_data02-INFINITY inf: GRP_data02
>> DRBD_ms_data02:Master
>> order order-DRBD_data01-GRP_data01-mandatory : DRBD_data01:promote
>> GRP_data01:start
>> order order-DRBD_data02-GRP_data02-mandatory : DRBD_data02:promote
>> GRP_data02:start
>> property $id="cib-bootstrap-options" \
>>   dc-version="1.1.8-7.el6-394e906" \
>>   cluster-infrastructure="cman" \
>>   stonith-enabled="false" \
>>   no-quorum-policy="ignore" \
>>   migration-threshold="1"
>> rsc_defaults $id="rsc_defaults-options" \
>>   resource-stickiness="100"
>> 
>> 
>> 1) After starting the cluster, everything runs happily:
>> 
>> Last updated: Tue Sep  3 00:11:13 2013
>> Last change: Tue Sep  3 00:05:15 2013 via cibadmin on pgdbsrv01.cl1.local
>> Stack: cman
>> Current DC: pgdbsrv02.cl1.local - partition with quorum
>> Version: 1.1.8-7.el6-394e906
>> 2 Nodes configured, unknown expected votes
>> 9 Resources configured.
>> 
>> Online: [ pgdbsrv01.cl1.local pgdbsrv02.cl1.local ]
>> 
>> Full list of resources:
>> 
>> Master/Slave Set: DRBD_ms_data01 [DRBD_data01]
>>   Masters: [ pgdbsrv01.cl1.local ]
>>   Slaves: [ pgdbsrv02.cl1.local ]
>> Master/Slave Set: DRBD_ms_data02 [DRBD_data02]
>>   Masters: [ pgdbsrv01.cl1.local ]
>>   Slaves: [ pgdbsrv02.cl1.local ]
>> Resource Group: GRP_data01
>>   LVM_vgdata01 (ocf::heartbeat:LVM): Started pgdbsrv01.cl1.local
>>   FS_data01 (ocf::heartbeat:Filesystem): Started pgdbsrv01.cl1.local
>> Resource Group: GRP_data02
>>   LVM_vgdata02 (ocf::heartbeat:LVM): Started pgdbsrv01.cl1.local
>>   FS_data02 (ocf::heartbeat:Filesystem): Started pgdbsrv01.cl1.local
>> 
>> 2) Putting node #1 to standby mode - after which everything runs happily on
>> node pgdbsrv02.cl1.local
>> 
>> # pcs cluster standby pgdbsrv01.cl1.local
>> # pcs status
>> Last updated: Tue Sep  3 00:16:01 2013
>> Last change: Tue Sep  3 00:15:55 2013 via crm_attribute on
>> pgdbsrv02.cl1.local
>> Stack: cman
>> Current DC: pgdbsrv02.cl1.local - partition with quorum
>> Version: 1.1.8-7.el6-394e906
>> 2 Nodes configured, unknown expected votes
>> 9 Resources configured.
>> 
>> 
>> Node pgdbsrv01.cl1.local: standby
>> Online: [ pgdbsrv02.cl1.local ]
>> 
>> Full list of resources:
>> 
>> IP_database     (ocf::heartbeat:IPaddr2):     Started pgdbsrv02.cl1.local
>> Master/Slave Set: DRBD_ms_data01 [DRBD_data01]
>>   Masters: [ pgdbsrv02.cl1.local ]
>>   Stopped: [ DRBD_data01:1 ]
>> Master/Slave Set: DRBD_ms_data02 [DRBD_data02]
>>   Masters: [ pgdbsrv02.cl1.local ]
>>   Stopped: [ DRBD_data02:1 ]
>> Resource Group: GRP_data01
>>   LVM_vgdata01     (ocf::heartbeat:LVM):     Started pgdbsrv02.cl1.local
>>   FS_data01     (ocf::heartbeat:Filesystem):     Started
>> pgdbsrv02.cl1.local
>> Resource Group: GRP_data02
>>   LVM_vgdata02     (ocf::heartbeat:LVM):     Started pgdbsrv02.cl1.local
>>   FS_data02     (ocf::heartbeat:Filesystem):     Started
>> pgdbsrv02.cl1.local
>> 
>> 3) Putting node #1 back online - it seems that all the resources stop (?)
>> and then DRBD gets promoted successfully on node #2 but LVM and FS resources
>> never start
>> 
>> # pcs cluster unstandby pgdbsrv01.cl1.local
>> # pcs status
>> Last updated: Tue Sep  3 00:17:00 2013
>> Last change: Tue Sep  3 00:16:56 2013 via crm_attribute on
>> pgdbsrv02.cl1.local
>> Stack: cman
>> Current DC: pgdbsrv02.cl1.local - partition with quorum
>> Version: 1.1.8-7.el6-394e906
>> 2 Nodes configured, unknown expected votes
>> 9 Resources configured.
>> 
>> 
>> Online: [ pgdbsrv01.cl1.local pgdbsrv02.cl1.local ]
>> 
>> Full list of resources:
>> 
>> IP_database     (ocf::heartbeat:IPaddr2):     Started pgdbsrv02.cl1.local
>> Master/Slave Set: DRBD_ms_data01 [DRBD_data01]
>>   Masters: [ pgdbsrv02.cl1.local ]
>>   Slaves: [ pgdbsrv01.cl1.local ]
>> Master/Slave Set: DRBD_ms_data02 [DRBD_data02]
>>   Masters: [ pgdbsrv02.cl1.local ]
>>   Slaves: [ pgdbsrv01.cl1.local ]
>> Resource Group: GRP_data01
>>   LVM_vgdata01     (ocf::heartbeat:LVM):     Stopped
>>   FS_data01     (ocf::heartbeat:Filesystem):     Stopped
>> Resource Group: GRP_data02
>>   LVM_vgdata02     (ocf::heartbeat:LVM):     Stopped
>>   FS_data02     (ocf::heartbeat:Filesystem):     Stopped
>> 
>> 
>> 
>> Any ideas why this is happening/what could be wrong in the resource
>> configuration? The same thing happens when testing the situation with the
>> resources located vice-versa in the beginning. Also, if I stop & start one
>> of the nodes, same thing happens once the node gets back online.
>> 
>> 
>> -- 
>> Heikki Manninen <h...@iki.fi>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>> 
>> 
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Resource ordering/colocating question (DRBD + LVM + FS)

Reply via email to