On 06/09/2013, at 5:51 PM, Lars Ellenberg <lars.ellenb...@linbit.com> wrote:
> On Tue, Aug 27, 2013 at 06:51:45AM +0200, Andreas Mock wrote: >> Hi Andrew, >> >> as this is a real showstopper at the moment I invested some other >> hours to be sure (as far as possible) not having made an error. >> >> Some additions: >> 1) I mirrored the whole mini drbd config to another pacemaker cluster. >> Same result: pacemaker 1.1.8 works, pacemaker 1.1.10 not >> 2) When I remove the target role Stopped from the drbd ms resource >> and insert the config snippet related to the drbd device via crm -f <file> >> to a lean running pacemaker config (pacemaker cluster options, stonith >> resources), >> it seems to work. That means one of the nodes gets promoted. >> >> Then after stopping 'crm resource stop ms_drbd_xxx' and starting again >> I see the same promotion error as described. >> >> The drbd resource agent is using /usr/sbin/crm_master. >> Is there a possibility that feedback given through this client tool >> is changing the timing behaviour of pacemaker? Or the way >> transitions are scheduled? >> Any idea that may be related to a change in pacemaker? > > I think that recent pacemaker allows for "start" and "promote" in the > same transition. At least in the one case I saw logs of, this wasn't the case. The PE computed: Current cluster status: Online: [ db05 db06 ] r_stonith-db05 (stonith:fence_imm): Started db06 r_stonith-db06 (stonith:fence_imm): Started db05 Master/Slave Set: ms_drbd_fodb [r_drbd_fodb] Slaves: [ db05 db06 ] Master/Slave Set: ms_drbd_fodblog [r_drbd_fodblog] Slaves: [ db05 db06 ] Transition Summary: * Promote r_drbd_fodb:0 (Slave -> Master db05) * Promote r_drbd_fodblog:0 (Slave -> Master db05) and it was the promotion of r_drbd_fodb:0 that failed. > Or at least has promote follow start much faster than before. > > To be able to really tell why DRBD has problems to promote, > I'd need the drbd (kernel!) logs, the agent logs are not good enough. > > Maybe you hit some interesting race between DRBD establishing the > connection, and pacemaker trying to promote one of the nodes. > Correlating DRBD kernel logs with pacemaker logs should tell. > > Do you have DRBD resource level fencing enabled? > > I guess right now you can reproduce it easily by: > crm resource stop ms_drbd > crm resource start ms_drbd > > I suspect you would not be able to reproduce by: > crm resource stop ms_drbd > crm resource demote ms_drbd (will only make drbd Secondary stuff) > ... meanwhile, DRBD will establish the connection ... > crm resource promote ms_drbd (will then promote one node) > > Hth, > > Lars > > >> -----Ursprüngliche Nachricht----- >> Von: Andrew Beekhof [mailto:and...@beekhof.net] >> Gesendet: Dienstag, 27. August 2013 05:02 >> An: General Linux-HA mailing list >> Cc: pacemaker@oss.clusterlabs.org >> Betreff: Re: [Pacemaker] [Linux-HA] Probably a regression of the linbit drbd >> agent between pacemaker 1.1.8 and 1.1.10 >> >> >> On 27/08/2013, at 3:31 AM, Andreas Mock <andreas.m...@web.de> wrote: >> >>> Hi all, >>> >>> while the linbit drbd resource agent seems to work perfectly on >>> pacemaker 1.1.8 (standard software repository) we have problems with >>> the last release 1.1.10 and also with the newest head 1.1.11.xxx. >>> >>> As using drbd is not so uncommon I really hope to find interested >>> people helping me out. I can provide as much debug information as you >>> want. >>> >>> >>> Environment: >>> RHEL 6.4 clone (Scientific Linux 6.4) cman based cluster. >>> DRBD 8.4.3 compiled from sources. >>> 64bit >>> >>> - A drbd resource configured following the linbit documentation. >>> - Manual start and stop (up/down) and setting primary of drbd resource >>> working smoothly. >>> - 2 nodes dis03-test/dis04-test >>> >>> >>> >>> - Following simple config on pacemaker 1.1.8 configure >>> property no-quorum-policy=stop >>> property stonith-enabled=true >>> rsc_defaults resource-stickiness=2 >>> primitive r_stonith-dis03-test stonith:fence_mock \ >>> meta resource-stickiness="INFINITY" target-role="Started" \ >>> op monitor interval="180" timeout="300" requires="nothing" \ >>> op start interval="0" timeout="300" \ >>> op stop interval="0" timeout="300" \ >>> params vmname=dis03-test pcmk_host_list="dis03-test" >>> primitive r_stonith-dis04-test stonith:fence_mock \ >>> meta resource-stickiness="INFINITY" target-role="Started" \ >>> op monitor interval="180" timeout="300" requires="nothing" \ >>> op start interval="0" timeout="300" \ >>> op stop interval="0" timeout="300" \ >>> params vmname=dis04-test pcmk_host_list="dis04-test" >>> location r_stonith-dis03_hates_dis03 r_stonith-dis03-test \ >>> rule $id="r_stonith-dis03_hates_dis03-test_rule" -inf: #uname >>> eq dis03-test >>> location r_stonith-dis04_hates_dis04 r_stonith-dis04-test \ >>> rule $id="r_stonith-dis04_hates_dis04-test_rule" -inf: #uname >>> eq dis04-test >>> primitive r_drbd_postfix ocf:linbit:drbd \ >>> params drbd_resource="postfix" drbdconf="/usr/local/etc/drbd.conf" >> \ >>> op monitor interval="15s" timeout="60s" role="Master" \ >>> op monitor interval="45s" timeout="60s" role="Slave" \ >>> op start timeout="240" \ >>> op stop timeout="240" \ >>> meta target-role="Stopped" migration-threshold="2" >>> ms ms_drbd_postfix r_drbd_postfix \ >>> meta master-max="1" master-node-max="1" \ >>> clone-max="2" clone-node-max="1" \ >>> notify="true" \ >>> meta target-role="Stopped" >>> commit >>> >>> - Pacemaker is started from scratch >>> - Config above is applied by crm -f <file> where <file> has the above >>> config snippet. >>> >>> - After that crm_mon shows the following status >>> ----------------------8<------------------------- >>> Last updated: Mon Aug 26 18:42:47 2013 Last change: Mon Aug 26 >>> 18:42:42 2013 via cibadmin on dis03-test >>> Stack: cman >>> Current DC: dis03-test - partition with quorum >>> Version: 1.1.10-1.el6-9abe687 >>> 2 Nodes configured >>> 4 Resources configured >>> >>> >>> Online: [ dis03-test dis04-test ] >>> >>> Full list of resources: >>> >>> r_stonith-dis03-test (stonith:fence_mock): Started dis04-test >>> r_stonith-dis04-test (stonith:fence_mock): Started dis03-test >>> Master/Slave Set: ms_drbd_postfix [r_drbd_postfix] >>> Stopped: [ dis03-test dis04-test ] >>> >>> Migration summary: >>> * Node dis04-test: >>> * Node dis03-test: >>> ----------------------8<------------------------- >>> >>> cat /proc/drbd >>> version: 8.4.3 (api:1/proto:86-101) >>> GIT-hash: 89a294209144b68adb3ee85a73221f964d3ee515 build by >>> root@dis03-test, >>> 2013-07-24 17:19:24 >>> >>> on both nodes. The drbd resource was shutdown previously in a clean >>> state, so that any node can be the primary. >>> >>> - Now the weird behaviour when trying to start the drbd with >>> crm resource start ms_drbd_postfix >>> >>> >>> Output of crm_mon -1rf >>> ----------------------8<------------------------- >>> Last updated: Mon Aug 26 18:46:33 2013 Last change: Mon Aug 26 >>> 18:46:30 2013 via cibadmin on dis04-test >>> Stack: cman >>> Current DC: dis03-test - partition with quorum >>> Version: 1.1.10-1.el6-9abe687 >>> 2 Nodes configured >>> 4 Resources configured >>> >>> >>> Online: [ dis03-test dis04-test ] >>> >>> Full list of resources: >>> >>> r_stonith-dis03-test (stonith:fence_mock): Started dis04-test >>> r_stonith-dis04-test (stonith:fence_mock): Started dis03-test >>> Master/Slave Set: ms_drbd_postfix [r_drbd_postfix] >>> Slaves: [ dis03-test ] >>> Stopped: [ dis04-test ] >>> >>> Migration summary: >>> * Node dis04-test: >>> r_drbd_postfix: migration-threshold=2 fail-count=2 last-failure='Mon >>> Aug >>> 26 18:46:30 2013' >>> * Node dis03-test: >>> >> >> Its hard to imagine how pacemaker could cause drbdadm to fail, short of >> leaving the other side promoted while trying to promote another. >> Perhaps the drbd folks could comment on what the error means. >> >>> Failed actions: >>> r_drbd_postfix_promote_0 (node=dis04-test, call=34, rc=1, >>> status=complete, last-rc-change=Mon Aug 26 18:46:29 2013 , >>> queued=1212ms, exec=0ms >>> ): unknown error >>> ----------------------8<------------------------- >>> >>> In the log of the drbd agent I can find the following when the >>> promoting request is handled on dis03-test >>> >>> ----------------------8<------------------------- >>> ++ drbdadm -c /usr/local/etc/drbd.conf primary postfix >>> 0: State change failed: (-2) Need access to UpToDate data Command >>> 'drbdsetup primary 0' terminated with exit code 17 >>> + cmd_out= >>> + ret=17 >>> + '[' 17 '!=' 0 ']' >>> + ocf_log err 'postfix: Called drbdadm -c /usr/local/etc/drbd.conf >>> + primary >>> postfix' >>> + '[' 2 -lt 2 ']' >>> + __OCF_PRIO=err >>> + shift >>> ----------------------8<------------------------- >>> >>> While working without problems on pacemaker 1.1.8 it doesn't work here. >>> The error message let me assume that there is a kind of race condition >>> where pacemaker is firing the promotion too early. >>> Probably it has something to do with applying attributes from the drbd >>> resource agent. >>> But this is just a guess and I really don't know. >>> >>> ONE ADDITIONAL information: As soon as I do a resource cleanup on the >>> "defective" node the master is promoted as expected. That means a: >>> crm resource cleanup r_drbd_postfix dis03-test results in the >>> following: >>> >>> ----------------------8<------------------------- >>> Last updated: Mon Aug 26 19:29:38 2013 Last change: Mon Aug 26 >>> 19:29:28 2013 via cibadmin on dis04-test >>> Stack: cman >>> Current DC: dis03-test - partition with quorum >>> Version: 1.1.10-1.el6-9abe687 >>> 2 Nodes configured >>> 4 Resources configured >>> >>> >>> Online: [ dis03-test dis04-test ] >>> >>> Full list of resources: >>> >>> r_stonith-dis03-test (stonith:fence_mock): Started dis04-test >>> r_stonith-dis04-test (stonith:fence_mock): Started dis03-test >>> Master/Slave Set: ms_drbd_postfix [r_drbd_postfix] >>> Masters: [ dis03-test ] >>> Slaves: [ dis04-test ] >>> >>> Migration summary: >>> * Node dis03-test: >>> * Node dis04-test: >>> ----------------------8<------------------------- >>> >>> >>> >>> I really hope I can get some attention as pacemaker 1.1.10 is a >>> milestone for Andrew and drbd from linbit is pretty sure a building >>> block of many pacemaker based clusters. >>> >>> Cluster log of DC dis03-test at http://pastebin.com/2S9Y6V3P DRBD >>> agent log at http://pastebin.com/ceYNEAhH >>> >>> >>> So, any help welcome. >>> >>> Best regards >>> Andreas Mock >>> >>> >>> _______________________________________________ >>> Linux-HA mailing list >>> linux...@lists.linux-ha.org >>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>> See also: http://linux-ha.org/ReportingProblems >> >> >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > -- > : Lars Ellenberg > : LINBIT | Your Way to High Availability > : DRBD/HA support and consulting http://www.linbit.com > > DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. > _______________________________________________ > Linux-HA mailing list > linux...@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org