On 07/11/2012 04:50 AM, Andrew Beekhof wrote: > On Wed, Jul 11, 2012 at 8:06 AM, Andreas Kurz <andr...@hastexo.com> wrote: >> On Tue, Jul 10, 2012 at 8:12 AM, Nikola Ciprich >> <nikola.cipr...@linuxbox.cz> wrote: >>> Hello Andreas, >>>> Why not using the RA that comes with the resource-agent package? >>> well, I've historically used my scripts, haven't even noticed when LVM >>> resource appeared.. I switched to it now.., thanks for the hint.. >>>> this "become-primary-on" was never activated? >>> nope. >>> >>> >>>> Is the drbd init script deactivated on system boot? Cluster logs should >>>> give more insights .... >>> yes, it's deactivated. I tried resyncinc drbd by hand, deleted logs, >>> rebooted both nodes, checked drbd ain't started and started corosync. >>> result is here: >>> http://nelide.cz/nik/logs.tar.gz >> >> It really really looks like Pacemaker is too fast when promoting to >> primary ... before the connection to the already up second node can be >> established. > > Do you mean we're violating a constraint? > Or is it a problem of the RA returning too soon?
It looks like a RA problem ... notifications after the start of the resource and the following promote are very fast and DRBD is still not finished with establishing the connection to the peer. I can't remember seeing this before. Regards, Andreas > >> I see in your logs you have DRBD 8.3.13 userland but >> 8.3.11 DRBD module installed ... can you test with 8.3.13 kernel module >> ... there have been fixes that look like addressing this problem. >> >> Another quick-fix, that should also do: add a start-delay of some >> seconds to the start operation of DRBD >> >> ... or fix your after-split-brain policies to automatically solve this >> special type of split-brain (with 0 blocks to sync). >> >> Best Regards, >> Andreas >> >> -- >> Need help with Pacemaker? >> http://www.hastexo.com/now >> >>> >>> thanks for Your time. >>> n. >>> >>> >>>> >>>> Regards, >>>> Andreas >>>> >>>> -- >>>> Need help with Pacemaker? >>>> http://www.hastexo.com/now >>>> >>>>> >>>>> thanks a lot in advance >>>>> >>>>> nik >>>>> >>>>> >>>>> On Sun, Jul 08, 2012 at 12:47:16AM +0200, Andreas Kurz wrote: >>>>>> On 07/02/2012 11:49 PM, Nikola Ciprich wrote: >>>>>>> hello, >>>>>>> >>>>>>> I'm trying to solve quite mysterious problem here.. >>>>>>> I've got new cluster with bunch of SAS disks for testing purposes. >>>>>>> I've configured DRBDs (in primary/primary configuration) >>>>>>> >>>>>>> when I start drbd using drbdadm, it get's up nicely (both nodes >>>>>>> are Primary, connected). >>>>>>> however when I start it using corosync, I always get split-brain, >>>>>>> although >>>>>>> there are no data written, no network disconnection, anything.. >>>>>> >>>>>> your full drbd and Pacemaker configuration please ... some snippets from >>>>>> something are very seldom helpful ... >>>>>> >>>>>> Regards, >>>>>> Andreas >>>>>> >>>>>> -- >>>>>> Need help with Pacemaker? >>>>>> http://www.hastexo.com/now >>>>>> >>>>>>> >>>>>>> here's drbd resource config: >>>>>>> primitive drbd-sas0 ocf:linbit:drbd \ >>>>>>> params drbd_resource="drbd-sas0" \ >>>>>>> operations $id="drbd-sas0-operations" \ >>>>>>> op start interval="0" timeout="240s" \ >>>>>>> op stop interval="0" timeout="200s" \ >>>>>>> op promote interval="0" timeout="200s" \ >>>>>>> op demote interval="0" timeout="200s" \ >>>>>>> op monitor interval="179s" role="Master" timeout="150s" \ >>>>>>> op monitor interval="180s" role="Slave" timeout="150s" >>>>>>> >>>>>>> ms ms-drbd-sas0 drbd-sas0 \ >>>>>>> meta clone-max="2" clone-node-max="1" master-max="2" >>>>>>> master-node-max="1" notify="true" globally-unique="false" >>>>>>> interleave="true" target-role="Started" >>>>>>> >>>>>>> >>>>>>> here's the dmesg output when pacemaker tries to promote drbd, causing >>>>>>> the splitbrain: >>>>>>> [ 157.646292] block drbd2: Starting worker thread (from drbdsetup >>>>>>> [6892]) >>>>>>> [ 157.646539] block drbd2: disk( Diskless -> Attaching ) >>>>>>> [ 157.650364] block drbd2: Found 1 transactions (1 active extents) in >>>>>>> activity log. >>>>>>> [ 157.650560] block drbd2: Method to ensure write ordering: drain >>>>>>> [ 157.650688] block drbd2: drbd_bm_resize called with capacity == >>>>>>> 584667688 >>>>>>> [ 157.653442] block drbd2: resync bitmap: bits=73083461 words=1141930 >>>>>>> pages=2231 >>>>>>> [ 157.653760] block drbd2: size = 279 GB (292333844 KB) >>>>>>> [ 157.671626] block drbd2: bitmap READ of 2231 pages took 18 jiffies >>>>>>> [ 157.673722] block drbd2: recounting of set bits took additional 2 >>>>>>> jiffies >>>>>>> [ 157.673846] block drbd2: 0 KB (0 bits) marked out-of-sync by on disk >>>>>>> bit-map. >>>>>>> [ 157.673972] block drbd2: disk( Attaching -> UpToDate ) >>>>>>> [ 157.674100] block drbd2: attached to UUIDs >>>>>>> 0150944D23F16BAE:0000000000000000:8C175205284E3262:8C165205284E3263 >>>>>>> [ 157.685539] block drbd2: conn( StandAlone -> Unconnected ) >>>>>>> [ 157.685704] block drbd2: Starting receiver thread (from drbd2_worker >>>>>>> [6893]) >>>>>>> [ 157.685928] block drbd2: receiver (re)started >>>>>>> [ 157.686071] block drbd2: conn( Unconnected -> WFConnection ) >>>>>>> [ 158.960577] block drbd2: role( Secondary -> Primary ) >>>>>>> [ 158.960815] block drbd2: new current UUID >>>>>>> 015E111F18D08945:0150944D23F16BAE:8C175205284E3262:8C165205284E3263 >>>>>>> [ 162.686990] block drbd2: Handshake successful: Agreed network >>>>>>> protocol version 96 >>>>>>> [ 162.687183] block drbd2: conn( WFConnection -> WFReportParams ) >>>>>>> [ 162.687404] block drbd2: Starting asender thread (from >>>>>>> drbd2_receiver [6927]) >>>>>>> [ 162.687741] block drbd2: data-integrity-alg: <not-used> >>>>>>> [ 162.687930] block drbd2: drbd_sync_handshake: >>>>>>> [ 162.688057] block drbd2: self >>>>>>> 015E111F18D08945:0150944D23F16BAE:8C175205284E3262:8C165205284E3263 >>>>>>> bits:0 flags:0 >>>>>>> [ 162.688244] block drbd2: peer >>>>>>> 7EC38CBFC3D28FFF:0150944D23F16BAF:8C175205284E3263:8C165205284E3263 >>>>>>> bits:0 flags:0 >>>>>>> [ 162.688428] block drbd2: uuid_compare()=100 by rule 90 >>>>>>> [ 162.688544] block drbd2: helper command: /sbin/drbdadm >>>>>>> initial-split-brain minor-2 >>>>>>> [ 162.691332] block drbd2: helper command: /sbin/drbdadm >>>>>>> initial-split-brain minor-2 exit code 0 (0x0) >>>>>>> >>>>>>> to me it seems to be that it's promoting it too early, and I also >>>>>>> wonder why there is the >>>>>>> "new current UUID" stuff? >>>>>>> >>>>>>> I'm using centos6, kernel 3.0.36, drbd-8.3.13, pacemaker-1.1.6 >>>>>>> >>>>>>> could anybody please try to advice me? I'm sure I'm doing something >>>>>>> stupid, but can't figure out what... >>>>>>> >>>>>>> thanks a lot in advance >>>>>>> >>>>>>> with best regards >>>>>>> >>>>>>> nik >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>> >>>>>>> Project Home: http://www.clusterlabs.org >>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>>> _______________________________________________ >>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>> >>>>>> Project Home: http://www.clusterlabs.org >>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>> Bugs: http://bugs.clusterlabs.org >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>> >>>>> Project Home: http://www.clusterlabs.org >>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>> Bugs: http://bugs.clusterlabs.org >>>>> >>>> >>>> >>>> >>>> >>>> >>> >>> >>> >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>> >>> >>> -- >>> ------------------------------------- >>> Ing. Nikola CIPRICH >>> LinuxBox.cz, s.r.o. >>> 28.rijna 168, 709 00 Ostrava >>> >>> tel.: +420 591 166 214 >>> fax: +420 596 621 273 >>> mobil: +420 777 093 799 >>> www.linuxbox.cz >>> >>> mobil servis: +420 737 238 656 >>> email servis: ser...@linuxbox.cz >>> ------------------------------------- >>> >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >>> >> >> >> >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > -- Need help with Pacemaker? http://www.hastexo.com/now
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org