On 27/08/2013, at 2:51 PM, Andreas Mock <andreas.m...@web.de> wrote:
> Hi Andrew, > > as this is a real showstopper at the moment I invested some other > hours to be sure (as far as possible) not having made an error. > > Some additions: > 1) I mirrored the whole mini drbd config to another pacemaker cluster. > Same result: pacemaker 1.1.8 works, pacemaker 1.1.10 not The version of drbd is the same too? > 2) When I remove the target role Stopped from the drbd ms resource > and insert the config snippet related to the drbd device via crm -f <file> > to a lean running pacemaker config (pacemaker cluster options, stonith > resources), > it seems to work. That means one of the nodes gets promoted. > > Then after stopping 'crm resource stop ms_drbd_xxx' and starting again > I see the same promotion error as described. > > The drbd resource agent is using /usr/sbin/crm_master. > Is there a possibility that feedback given through this client tool > is changing the timing behaviour of pacemaker? Or the way > transitions are scheduled? > Any idea that may be related to a change in pacemaker? # git diff --stat Pacemaker-1.1.8..Pacemaker-1.1.10 | tail -n 1 1610 files changed, 109697 insertions(+), 62940 deletions(-) Needle, meet haystack. Particularly since I have no idea what that drbd error means. If you want me to have a look, you'll need to create a crm_report archive of "works" and "not works". Logs aren't enough. > > Best regards > Andreas Mock > > > -----Ursprüngliche Nachricht----- > Von: Andrew Beekhof [mailto:and...@beekhof.net] > Gesendet: Dienstag, 27. August 2013 05:02 > An: General Linux-HA mailing list > Cc: pacemaker@oss.clusterlabs.org > Betreff: Re: [Pacemaker] [Linux-HA] Probably a regression of the linbit drbd > agent between pacemaker 1.1.8 and 1.1.10 > > > On 27/08/2013, at 3:31 AM, Andreas Mock <andreas.m...@web.de> wrote: > >> Hi all, >> >> while the linbit drbd resource agent seems to work perfectly on >> pacemaker 1.1.8 (standard software repository) we have problems with >> the last release 1.1.10 and also with the newest head 1.1.11.xxx. >> >> As using drbd is not so uncommon I really hope to find interested >> people helping me out. I can provide as much debug information as you >> want. >> >> >> Environment: >> RHEL 6.4 clone (Scientific Linux 6.4) cman based cluster. >> DRBD 8.4.3 compiled from sources. >> 64bit >> >> - A drbd resource configured following the linbit documentation. >> - Manual start and stop (up/down) and setting primary of drbd resource >> working smoothly. >> - 2 nodes dis03-test/dis04-test >> >> >> >> - Following simple config on pacemaker 1.1.8 configure >> property no-quorum-policy=stop >> property stonith-enabled=true >> rsc_defaults resource-stickiness=2 >> primitive r_stonith-dis03-test stonith:fence_mock \ >> meta resource-stickiness="INFINITY" target-role="Started" \ >> op monitor interval="180" timeout="300" requires="nothing" \ >> op start interval="0" timeout="300" \ >> op stop interval="0" timeout="300" \ >> params vmname=dis03-test pcmk_host_list="dis03-test" >> primitive r_stonith-dis04-test stonith:fence_mock \ >> meta resource-stickiness="INFINITY" target-role="Started" \ >> op monitor interval="180" timeout="300" requires="nothing" \ >> op start interval="0" timeout="300" \ >> op stop interval="0" timeout="300" \ >> params vmname=dis04-test pcmk_host_list="dis04-test" >> location r_stonith-dis03_hates_dis03 r_stonith-dis03-test \ >> rule $id="r_stonith-dis03_hates_dis03-test_rule" -inf: #uname >> eq dis03-test >> location r_stonith-dis04_hates_dis04 r_stonith-dis04-test \ >> rule $id="r_stonith-dis04_hates_dis04-test_rule" -inf: #uname >> eq dis04-test >> primitive r_drbd_postfix ocf:linbit:drbd \ >> params drbd_resource="postfix" drbdconf="/usr/local/etc/drbd.conf" > \ >> op monitor interval="15s" timeout="60s" role="Master" \ >> op monitor interval="45s" timeout="60s" role="Slave" \ >> op start timeout="240" \ >> op stop timeout="240" \ >> meta target-role="Stopped" migration-threshold="2" >> ms ms_drbd_postfix r_drbd_postfix \ >> meta master-max="1" master-node-max="1" \ >> clone-max="2" clone-node-max="1" \ >> notify="true" \ >> meta target-role="Stopped" >> commit >> >> - Pacemaker is started from scratch >> - Config above is applied by crm -f <file> where <file> has the above >> config snippet. >> >> - After that crm_mon shows the following status >> ----------------------8<------------------------- >> Last updated: Mon Aug 26 18:42:47 2013 Last change: Mon Aug 26 >> 18:42:42 2013 via cibadmin on dis03-test >> Stack: cman >> Current DC: dis03-test - partition with quorum >> Version: 1.1.10-1.el6-9abe687 >> 2 Nodes configured >> 4 Resources configured >> >> >> Online: [ dis03-test dis04-test ] >> >> Full list of resources: >> >> r_stonith-dis03-test (stonith:fence_mock): Started dis04-test >> r_stonith-dis04-test (stonith:fence_mock): Started dis03-test >> Master/Slave Set: ms_drbd_postfix [r_drbd_postfix] >> Stopped: [ dis03-test dis04-test ] >> >> Migration summary: >> * Node dis04-test: >> * Node dis03-test: >> ----------------------8<------------------------- >> >> cat /proc/drbd >> version: 8.4.3 (api:1/proto:86-101) >> GIT-hash: 89a294209144b68adb3ee85a73221f964d3ee515 build by >> root@dis03-test, >> 2013-07-24 17:19:24 >> >> on both nodes. The drbd resource was shutdown previously in a clean >> state, so that any node can be the primary. >> >> - Now the weird behaviour when trying to start the drbd with >> crm resource start ms_drbd_postfix >> >> >> Output of crm_mon -1rf >> ----------------------8<------------------------- >> Last updated: Mon Aug 26 18:46:33 2013 Last change: Mon Aug 26 >> 18:46:30 2013 via cibadmin on dis04-test >> Stack: cman >> Current DC: dis03-test - partition with quorum >> Version: 1.1.10-1.el6-9abe687 >> 2 Nodes configured >> 4 Resources configured >> >> >> Online: [ dis03-test dis04-test ] >> >> Full list of resources: >> >> r_stonith-dis03-test (stonith:fence_mock): Started dis04-test >> r_stonith-dis04-test (stonith:fence_mock): Started dis03-test >> Master/Slave Set: ms_drbd_postfix [r_drbd_postfix] >> Slaves: [ dis03-test ] >> Stopped: [ dis04-test ] >> >> Migration summary: >> * Node dis04-test: >> r_drbd_postfix: migration-threshold=2 fail-count=2 last-failure='Mon >> Aug >> 26 18:46:30 2013' >> * Node dis03-test: >> > > Its hard to imagine how pacemaker could cause drbdadm to fail, short of > leaving the other side promoted while trying to promote another. > Perhaps the drbd folks could comment on what the error means. > >> Failed actions: >> r_drbd_postfix_promote_0 (node=dis04-test, call=34, rc=1, >> status=complete, last-rc-change=Mon Aug 26 18:46:29 2013 , >> queued=1212ms, exec=0ms >> ): unknown error >> ----------------------8<------------------------- >> >> In the log of the drbd agent I can find the following when the >> promoting request is handled on dis03-test >> >> ----------------------8<------------------------- >> ++ drbdadm -c /usr/local/etc/drbd.conf primary postfix >> 0: State change failed: (-2) Need access to UpToDate data Command >> 'drbdsetup primary 0' terminated with exit code 17 >> + cmd_out= >> + ret=17 >> + '[' 17 '!=' 0 ']' >> + ocf_log err 'postfix: Called drbdadm -c /usr/local/etc/drbd.conf >> + primary >> postfix' >> + '[' 2 -lt 2 ']' >> + __OCF_PRIO=err >> + shift >> ----------------------8<------------------------- >> >> While working without problems on pacemaker 1.1.8 it doesn't work here. >> The error message let me assume that there is a kind of race condition >> where pacemaker is firing the promotion too early. >> Probably it has something to do with applying attributes from the drbd >> resource agent. >> But this is just a guess and I really don't know. >> >> ONE ADDITIONAL information: As soon as I do a resource cleanup on the >> "defective" node the master is promoted as expected. That means a: >> crm resource cleanup r_drbd_postfix dis03-test results in the >> following: >> >> ----------------------8<------------------------- >> Last updated: Mon Aug 26 19:29:38 2013 Last change: Mon Aug 26 >> 19:29:28 2013 via cibadmin on dis04-test >> Stack: cman >> Current DC: dis03-test - partition with quorum >> Version: 1.1.10-1.el6-9abe687 >> 2 Nodes configured >> 4 Resources configured >> >> >> Online: [ dis03-test dis04-test ] >> >> Full list of resources: >> >> r_stonith-dis03-test (stonith:fence_mock): Started dis04-test >> r_stonith-dis04-test (stonith:fence_mock): Started dis03-test >> Master/Slave Set: ms_drbd_postfix [r_drbd_postfix] >> Masters: [ dis03-test ] >> Slaves: [ dis04-test ] >> >> Migration summary: >> * Node dis03-test: >> * Node dis04-test: >> ----------------------8<------------------------- >> >> >> >> I really hope I can get some attention as pacemaker 1.1.10 is a >> milestone for Andrew and drbd from linbit is pretty sure a building >> block of many pacemaker based clusters. >> >> Cluster log of DC dis03-test at http://pastebin.com/2S9Y6V3P DRBD >> agent log at http://pastebin.com/ceYNEAhH >> >> >> So, any help welcome. >> >> Best regards >> Andreas Mock >> >> >> _______________________________________________ >> Linux-HA mailing list >> linux...@lists.linux-ha.org >> http://lists.linux-ha.org/mailman/listinfo/linux-ha >> See also: http://linux-ha.org/ReportingProblems > > > _______________________________________________ > Linux-HA mailing list > linux...@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org