Re: [Pacemaker] [Linux-HA] Probably a regression of the linbit drbd agent between pacemaker 1.1.8 and 1.1.10

Andrew Beekhof Sun, 08 Sep 2013 21:48:43 -0700

On 06/09/2013, at 5:51 PM, Lars Ellenberg <lars.ellenb...@linbit.com> wrote:


> On Tue, Aug 27, 2013 at 06:51:45AM +0200, Andreas Mock wrote:
>> Hi Andrew,
>> 
>> as this is a real showstopper at the moment I invested some other
>> hours to be sure (as far as possible) not having made an error.
>> 
>> Some additions:
>> 1) I mirrored the whole mini drbd config to another pacemaker cluster.
>> Same result: pacemaker 1.1.8 works, pacemaker 1.1.10 not 
>> 2) When I remove the target role Stopped from the drbd ms resource
>> and insert the config snippet related to the drbd device via crm -f <file>
>> to a lean running pacemaker config (pacemaker cluster options, stonith
>> resources),
>> it seems to work. That means one of the nodes gets promoted.
>> 
>> Then after stopping 'crm resource stop ms_drbd_xxx' and starting again
>> I see the same promotion error as described.
>> 
>> The drbd resource agent is using /usr/sbin/crm_master.
>> Is there a possibility that feedback given through this client tool
>> is changing the timing behaviour of pacemaker? Or the way
>> transitions are scheduled?
>> Any idea that may be related to a change in pacemaker?
> 
> I think that recent pacemaker allows for "start" and "promote" in the
> same transition.

At least in the one case I saw logs of, this wasn't the case.
The PE computed:

Current cluster status:
Online: [ db05 db06 ]

r_stonith-db05  (stonith:fence_imm):    Started db06 
r_stonith-db06  (stonith:fence_imm):    Started db05 
Master/Slave Set: ms_drbd_fodb [r_drbd_fodb]
    Slaves: [ db05 db06 ]
Master/Slave Set: ms_drbd_fodblog [r_drbd_fodblog]
    Slaves: [ db05 db06 ]

Transition Summary:
* Promote r_drbd_fodb:0 (Slave -> Master db05)
* Promote r_drbd_fodblog:0      (Slave -> Master db05)

and it was the promotion of r_drbd_fodb:0 that failed.

> Or at least has promote follow start much faster than before.
> 
> To be able to really tell why DRBD has problems to promote,
> I'd need the drbd (kernel!) logs, the agent logs are not good enough.
> 
> Maybe you hit some interesting race between DRBD establishing the
> connection, and pacemaker trying to promote one of the nodes.
> Correlating DRBD kernel logs with pacemaker logs should tell.
> 
> Do you have DRBD resource level fencing enabled?
> 
> I guess right now you can reproduce it easily by:
>  crm resource stop ms_drbd
>  crm resource start ms_drbd
> 
> I suspect you would not be able to reproduce by:
>  crm resource stop ms_drbd
>  crm resource demote ms_drbd (will only make drbd Secondary stuff)
>    ... meanwhile, DRBD will establish the connection ...
>  crm resource promote ms_drbd (will then promote one node)
> 
> Hth,
> 
>        Lars
> 
> 
>> -----Ursprüngliche Nachricht-----
>> Von: Andrew Beekhof [mailto:and...@beekhof.net] 
>> Gesendet: Dienstag, 27. August 2013 05:02
>> An: General Linux-HA mailing list
>> Cc: pacemaker@oss.clusterlabs.org
>> Betreff: Re: [Pacemaker] [Linux-HA] Probably a regression of the linbit drbd
>> agent between pacemaker 1.1.8 and 1.1.10
>> 
>> 
>> On 27/08/2013, at 3:31 AM, Andreas Mock <andreas.m...@web.de> wrote:
>> 
>>> Hi all,
>>> 
>>> while the linbit drbd resource agent seems to work perfectly on 
>>> pacemaker 1.1.8 (standard software repository) we have problems with 
>>> the last release 1.1.10 and also with the newest head 1.1.11.xxx.
>>> 
>>> As using drbd is not so uncommon I really hope to find interested 
>>> people helping me out. I can provide as much debug information as you 
>>> want.
>>> 
>>> 
>>> Environment:
>>> RHEL 6.4 clone (Scientific Linux 6.4) cman based cluster.
>>> DRBD 8.4.3 compiled from sources.
>>> 64bit
>>> 
>>> - A drbd resource configured following the linbit documentation.
>>> - Manual start and stop (up/down) and setting primary of drbd resource 
>>> working smoothly.
>>> - 2 nodes dis03-test/dis04-test
>>> 
>>> 
>>> 
>>> - Following simple config on pacemaker 1.1.8 configure
>>>   property no-quorum-policy=stop
>>>   property stonith-enabled=true
>>>   rsc_defaults resource-stickiness=2
>>>   primitive r_stonith-dis03-test stonith:fence_mock \
>>>       meta resource-stickiness="INFINITY" target-role="Started" \
>>>       op monitor interval="180" timeout="300" requires="nothing" \
>>>       op start interval="0" timeout="300" \
>>>       op stop interval="0" timeout="300" \
>>>       params vmname=dis03-test pcmk_host_list="dis03-test"
>>>   primitive r_stonith-dis04-test stonith:fence_mock \
>>>       meta resource-stickiness="INFINITY" target-role="Started" \
>>>       op monitor interval="180" timeout="300" requires="nothing" \
>>>       op start interval="0" timeout="300" \
>>>       op stop interval="0" timeout="300" \
>>>       params vmname=dis04-test pcmk_host_list="dis04-test"
>>>   location r_stonith-dis03_hates_dis03 r_stonith-dis03-test \
>>>       rule $id="r_stonith-dis03_hates_dis03-test_rule" -inf: #uname 
>>> eq dis03-test
>>>   location r_stonith-dis04_hates_dis04 r_stonith-dis04-test \
>>>       rule $id="r_stonith-dis04_hates_dis04-test_rule" -inf: #uname 
>>> eq dis04-test
>>>   primitive r_drbd_postfix ocf:linbit:drbd \
>>>       params drbd_resource="postfix" drbdconf="/usr/local/etc/drbd.conf"
>> \
>>>       op monitor interval="15s"  timeout="60s" role="Master" \
>>>       op monitor interval="45s"  timeout="60s" role="Slave" \
>>>       op start timeout="240" \
>>>       op stop timeout="240" \
>>>       meta target-role="Stopped" migration-threshold="2"
>>>   ms ms_drbd_postfix r_drbd_postfix \
>>>       meta master-max="1" master-node-max="1" \
>>>       clone-max="2" clone-node-max="1" \
>>>       notify="true" \
>>>       meta target-role="Stopped"
>>> commit
>>> 
>>> - Pacemaker is started from scratch
>>> - Config above is applied by crm -f <file> where <file> has the above 
>>> config snippet.
>>> 
>>> - After that crm_mon shows the following status
>>> ----------------------8<-------------------------
>>> Last updated: Mon Aug 26 18:42:47 2013 Last change: Mon Aug 26 
>>> 18:42:42 2013 via cibadmin on dis03-test
>>> Stack: cman
>>> Current DC: dis03-test - partition with quorum
>>> Version: 1.1.10-1.el6-9abe687
>>> 2 Nodes configured
>>> 4 Resources configured
>>> 
>>> 
>>> Online: [ dis03-test dis04-test ]
>>> 
>>> Full list of resources:
>>> 
>>> r_stonith-dis03-test   (stonith:fence_mock):   Started dis04-test
>>> r_stonith-dis04-test   (stonith:fence_mock):   Started dis03-test
>>> Master/Slave Set: ms_drbd_postfix [r_drbd_postfix]
>>>    Stopped: [ dis03-test dis04-test ]
>>> 
>>> Migration summary:
>>> * Node dis04-test:
>>> * Node dis03-test:
>>> ----------------------8<-------------------------
>>> 
>>> cat /proc/drbd
>>> version: 8.4.3 (api:1/proto:86-101)
>>> GIT-hash: 89a294209144b68adb3ee85a73221f964d3ee515 build by 
>>> root@dis03-test,
>>> 2013-07-24 17:19:24
>>> 
>>> on both nodes. The drbd resource was shutdown previously in a clean 
>>> state, so that any node can be the primary.
>>> 
>>> - Now the weird behaviour when trying to start the drbd with
>>>  crm resource start ms_drbd_postfix
>>> 
>>> 
>>> Output of crm_mon -1rf
>>> ----------------------8<-------------------------
>>> Last updated: Mon Aug 26 18:46:33 2013 Last change: Mon Aug 26 
>>> 18:46:30 2013 via cibadmin on dis04-test
>>> Stack: cman
>>> Current DC: dis03-test - partition with quorum
>>> Version: 1.1.10-1.el6-9abe687
>>> 2 Nodes configured
>>> 4 Resources configured
>>> 
>>> 
>>> Online: [ dis03-test dis04-test ]
>>> 
>>> Full list of resources:
>>> 
>>> r_stonith-dis03-test   (stonith:fence_mock):   Started dis04-test
>>> r_stonith-dis04-test   (stonith:fence_mock):   Started dis03-test
>>> Master/Slave Set: ms_drbd_postfix [r_drbd_postfix]
>>>    Slaves: [ dis03-test ]
>>>    Stopped: [ dis04-test ]
>>> 
>>> Migration summary:
>>> * Node dis04-test:
>>>  r_drbd_postfix: migration-threshold=2 fail-count=2 last-failure='Mon 
>>> Aug
>>> 26 18:46:30 2013'
>>> * Node dis03-test:
>>> 
>> 
>> Its hard to imagine how pacemaker could cause drbdadm to fail, short of
>> leaving the other side promoted while trying to promote another.
>> Perhaps the drbd folks could comment on what the error means.
>> 
>>> Failed actions:
>>>   r_drbd_postfix_promote_0 (node=dis04-test, call=34, rc=1, 
>>> status=complete, last-rc-change=Mon Aug 26 18:46:29 2013 , 
>>> queued=1212ms, exec=0ms
>>> ): unknown error
>>> ----------------------8<-------------------------
>>> 
>>> In the log of the drbd agent I can find the following when the 
>>> promoting request is handled on dis03-test
>>> 
>>> ----------------------8<-------------------------
>>> ++ drbdadm -c /usr/local/etc/drbd.conf primary postfix
>>> 0: State change failed: (-2) Need access to UpToDate data Command 
>>> 'drbdsetup primary 0' terminated with exit code 17
>>> + cmd_out=
>>> + ret=17
>>> + '[' 17 '!=' 0 ']'
>>> + ocf_log err 'postfix: Called drbdadm -c /usr/local/etc/drbd.conf 
>>> + primary
>>> postfix'
>>> + '[' 2 -lt 2 ']'
>>> + __OCF_PRIO=err
>>> + shift
>>> ----------------------8<-------------------------
>>> 
>>> While working without problems on pacemaker 1.1.8 it doesn't work here.
>>> The error message let me assume that there is a kind of race condition 
>>> where pacemaker is firing the promotion too early.
>>> Probably it has something to do with applying attributes from the drbd 
>>> resource agent.
>>> But this is just a guess and I really don't know.
>>> 
>>> ONE ADDITIONAL information: As soon as I do a resource cleanup on the 
>>> "defective" node the master is promoted as expected. That means a:
>>>  crm resource cleanup r_drbd_postfix dis03-test results in the 
>>> following:
>>> 
>>> ----------------------8<-------------------------
>>> Last updated: Mon Aug 26 19:29:38 2013 Last change: Mon Aug 26 
>>> 19:29:28 2013 via cibadmin on dis04-test
>>> Stack: cman
>>> Current DC: dis03-test - partition with quorum
>>> Version: 1.1.10-1.el6-9abe687
>>> 2 Nodes configured
>>> 4 Resources configured
>>> 
>>> 
>>> Online: [ dis03-test dis04-test ]
>>> 
>>> Full list of resources:
>>> 
>>> r_stonith-dis03-test   (stonith:fence_mock):   Started dis04-test
>>> r_stonith-dis04-test   (stonith:fence_mock):   Started dis03-test
>>> Master/Slave Set: ms_drbd_postfix [r_drbd_postfix]
>>>    Masters: [ dis03-test ]
>>>    Slaves: [ dis04-test ]
>>> 
>>> Migration summary:
>>> * Node dis03-test:
>>> * Node dis04-test:
>>> ----------------------8<-------------------------
>>> 
>>> 
>>> 
>>> I really hope I can get some attention as pacemaker 1.1.10 is a 
>>> milestone for Andrew and drbd from linbit is pretty sure a building 
>>> block of many pacemaker based clusters.
>>> 
>>> Cluster log of DC dis03-test at http://pastebin.com/2S9Y6V3P DRBD 
>>> agent log at http://pastebin.com/ceYNEAhH
>>> 
>>> 
>>> So, any help welcome.
>>> 
>>> Best regards
>>> Andreas Mock
>>> 
>>> 
>>> _______________________________________________
>>> Linux-HA mailing list
>>> linux...@lists.linux-ha.org
>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>> See also: http://linux-ha.org/ReportingProblems
>> 
>> 
>> 
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> -- 
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
> 
> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> _______________________________________________
> Linux-HA mailing list
> linux...@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] [Linux-HA] Probably a regression of the linbit drbd agent between pacemaker 1.1.8 and 1.1.10

Reply via email to