For those who are interested, or those who might stumble on this thread through Google:
After fiddling around a bit longer without any reliable results, I've decided to install DRBD8 instead of DRBD9, as Yannis suggested earlier. Worked straight away, and after a bit of tuning it works awesome. Kind regards, D. On 24-05-18 07:54, Dirk Bonenkamp - ProActive wrote: > Hello, > > Thank you for your suggestion. The MTU is 1500 on both nodes. I had it > at 9000, but reverted everything to 'normal' to debug this problem. > Pinging as in your example works fine. > > Cheers, > > Dirk > > On 23-05-18 21:22, Nelson Hicks wrote: >> Is there any chance this could be an MTU mismatch between the two >> nodes? If you use ping with varying packet sizes from one node to the >> other, do they stop working above a specific size? Does ifconfig >> report the same MTU size for the interface on both nodes? >> >> Examples: >> >> ifconfig | grep MTU >> >> ping -s 500 <other_ip> >> >> ping -s 1400 <other_ip> >> >> ping -s 1472 <other_ip> >> >> ping -s 2000 <other_ip> >> >> Thanks, >> >> - Nelson Hicks >> >> >> >> >> On 05/23/2018 02:07 PM, Dirk Bonenkamp - ProActive wrote: >>> Hi, >>> >>> Thank you for your reply. >>> >>> I am / was under the impression that DRBD9 is the new and improved >>> DRBD, so I figured to use this version. But this is not the case? >>> Could somebody enlighten me a bit? >>> >>> I already have disabled all bonding and other fancy network stuff, >>> so I'm using 1 nic currently. This doesn't solve anything >>> unfortunately. >>> >>> Kind regards, >>> >>> Dirk >>> >>> On 23-05-18 14:20, Yannis Milios wrote: >>>> Two things: >>>> >>>> - I would use drbd8 instead of drbd9 for a 2 node setup. >>>> - I would first test with 1 nic instead of 2. >>>> >>>> On Wed, May 23, 2018 at 11:01 AM, Dirk Bonenkamp - ProActive >>>> <[email protected] <mailto:[email protected]>> wrote: >>>> >>>> Hi List, >>>> >>>> I'm struggling with a new DRBD9 setup. It's a simple Master/Slave >>>> setup. >>>> I'm running Ubuntu 16.04 LTS with the DRBD9 packages from the >>>> Launchpad PPA. >>>> >>>> I'm running some DRBD8 systems in production for quite some >>>> years, so I >>>> have some experience. This setup is very similar, the only major >>>> difference is that this is DRBD9 and I use LUKS encrypted >>>> partitions as >>>> backend. >>>> >>>> I keep running into this 'PingAck did not arrive in time.' error, >>>> which >>>> points to network issues if I am correct (see complete log snippet >>>> below). This error occurs when I try to reattach the secondary >>>> node >>>> after a reboot. Initial sync works fine. >>>> >>>> The servers are interconnected with 2 10Gb NICs. I had bonding & >>>> jumbo >>>> frames configured, but deactivated all this, to no avail. I've >>>> also >>>> stripped the DRBD configuration to the bare minimum (see below). >>>> >>>> I've tested the connection with iperf and some other tools and it >>>> seems >>>> just fine. >>>> >>>> Could somebody point me in the right direction? >>>> >>>> Thank you in advance, regards, >>>> >>>> Dirk Bonenkamp >>>> >>>> syslog messages: >>>> >>>> May 23 11:31:56 data2 kernel: [ 704.111755] drbd: loading >>>> out-of-tree >>>> module taints kernel. >>>> May 23 11:31:56 data2 kernel: [ 704.112290] drbd: module >>>> verification >>>> failed: signature and/or required key missing - tainting kernel >>>> May 23 11:31:56 data2 kernel: [ 704.127677] drbd: initialized. >>>> Version: >>>> 9.0.14-1 (api:2/proto:86-113) >>>> May 23 11:31:56 data2 kernel: [ 704.127680] drbd: GIT-hash: >>>> 62f906cf44ef02a30ce0c148fec223b40c51c533 build by root@data2, >>>> 2018-05-23 >>>> 09:19:54 >>>> May 23 11:31:56 data2 kernel: [ 704.127683] drbd: registered as >>>> block >>>> device major 147 >>>> May 23 11:31:56 data2 kernel: [ 704.153565] drbd r0: Starting >>>> worker >>>> thread (from drbdsetup [4495]) >>>> May 23 11:31:56 data2 kernel: [ 704.183031] drbd r0/0 drbd0: >>>> disk( >>>> Diskless -> Attaching ) >>>> May 23 11:31:56 data2 kernel: [ 704.183066] drbd r0/0 drbd0: >>>> Maximum >>>> number of peer devices = 1 >>>> May 23 11:31:56 data2 kernel: [ 704.183293] drbd r0: Method to >>>> ensure >>>> write ordering: flush >>>> May 23 11:31:56 data2 kernel: [ 704.183308] drbd r0/0 drbd0: >>>> drbd_bm_resize called with capacity == 273437203064 >>>> May 23 11:31:58 data2 kernel: [ 706.508228] drbd r0/0 drbd0: >>>> resync >>>> bitmap: bits=34179650383 words=534057038 pages=1043081 >>>> May 23 11:31:58 data2 kernel: [ 706.508234] drbd r0/0 drbd0: >>>> size = 127 >>>> TB (136718601532 KB) >>>> May 23 11:31:58 data2 kernel: [ 706.508236] drbd r0/0 drbd0: >>>> size = 127 >>>> TB (136718601532 KB) >>>> May 23 11:32:10 data2 kernel: [ 717.890420] drbd r0/0 drbd0: >>>> recounting >>>> of set bits took additional 1256ms >>>> May 23 11:32:10 data2 kernel: [ 717.890435] drbd r0/0 drbd0: >>>> disk( >>>> Attaching -> Outdated ) >>>> May 23 11:32:10 data2 kernel: [ 717.890439] drbd r0/0 drbd0: >>>> attached >>>> to current UUID: 244DD61D2781DF44 >>>> May 23 11:32:10 data2 kernel: [ 717.918473] drbd r0 data1: >>>> Starting >>>> sender thread (from drbdsetup [4544]) >>>> May 23 11:32:10 data2 kernel: [ 717.922534] drbd r0 data1: conn( >>>> StandAlone -> Unconnected ) >>>> May 23 11:32:10 data2 kernel: [ 717.922820] drbd r0 data1: >>>> Starting >>>> receiver thread (from drbd_w_r0 [4498]) >>>> May 23 11:32:10 data2 kernel: [ 717.922973] drbd r0 data1: conn( >>>> Unconnected -> Connecting ) >>>> May 23 11:32:10 data2 kernel: [ 718.421219] drbd r0 data1: >>>> Handshake to >>>> peer 1 successful: Agreed network protocol version 113 >>>> May 23 11:32:10 data2 kernel: [ 718.421229] drbd r0 data1: >>>> Feature >>>> flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME >>>> WRITE_ZEROES. >>>> May 23 11:32:10 data2 kernel: [ 718.421259] drbd r0 data1: >>>> Starting >>>> ack_recv thread (from drbd_r_r0 [4550]) >>>> May 23 11:32:10 data2 kernel: [ 718.424095] drbd r0: Preparing >>>> cluster-wide state change 1205605755 (0->1 499/146) >>>> May 23 11:32:10 data2 kernel: [ 718.437172] drbd r0: State change >>>> 1205605755: primary_nodes=2, weak_nodes=FFFFFFFFFFFFFFFC >>>> May 23 11:32:10 data2 kernel: [ 718.437185] drbd r0: Aborting >>>> cluster-wide state change 1205605755 (12ms) rv = -22 >>>> May 23 11:32:12 data2 kernel: [ 719.896223] drbd r0: Preparing >>>> cluster-wide state change 445952355 (0->1 499/146) >>>> May 23 11:32:12 data2 kernel: [ 719.896498] drbd r0: State change >>>> 445952355: primary_nodes=2, weak_nodes=FFFFFFFFFFFFFFFC >>>> May 23 11:32:12 data2 kernel: [ 719.896508] drbd r0: Committing >>>> cluster-wide state change 445952355 (0ms) >>>> May 23 11:32:12 data2 kernel: [ 719.896541] drbd r0 data1: conn( >>>> Connecting -> Connected ) peer( Unknown -> Primary ) >>>> May 23 11:32:12 data2 kernel: [ 719.912186] drbd r0/0 drbd0 >>>> data1: >>>> drbd_sync_handshake: >>>> May 23 11:32:12 data2 kernel: [ 719.912198] drbd r0/0 drbd0 >>>> data1: self >>>> >>>> 244DD61D2781DF44:0000000000000000:0000000000000000:0000000000000000 >>>> bits:52035 flags:20 >>>> May 23 11:32:12 data2 kernel: [ 719.912207] drbd r0/0 drbd0 >>>> data1: peer >>>> >>>> E38BE51FE782EAE0:244DD61D2781DF44:934CAB8662DF0410:E555BDC58E528356 >>>> bits:53162 flags:20 >>>> May 23 11:32:12 data2 kernel: [ 719.912214] drbd r0/0 drbd0 >>>> data1: >>>> uuid_compare()=-2 by rule 50 >>>> May 23 11:32:12 data2 kernel: [ 719.912248] drbd r0/0 drbd0 >>>> data1: >>>> pdsk( DUnknown -> UpToDate ) repl( Off -> WFBitMapT ) >>>> May 23 11:32:32 data2 kernel: [ 740.397026] drbd r0 data1: >>>> PingAck did >>>> not arrive in time. >>>> May 23 11:32:32 data2 kernel: [ 740.397121] drbd r0 data1: conn( >>>> Connected -> NetworkFailure ) peer( Primary -> Unknown ) >>>> May 23 11:32:32 data2 kernel: [ 740.397131] drbd r0/0 drbd0 >>>> data1: >>>> pdsk( UpToDate -> DUnknown ) repl( WFBitMapT -> Off ) >>>> May 23 11:32:32 data2 kernel: [ 740.397176] drbd r0 data1: >>>> ack_receiver >>>> terminated >>>> May 23 11:32:32 data2 kernel: [ 740.397182] drbd r0 data1: >>>> Terminating >>>> ack_recv thread >>>> May 23 11:32:32 data2 kernel: [ 740.458608] drbd r0 data1: >>>> Connection >>>> closed >>>> May 23 11:32:32 data2 kernel: [ 740.458650] drbd r0 data1: conn( >>>> NetworkFailure -> Unconnected ) >>>> May 23 11:32:32 data2 kernel: [ 740.458688] drbd r0 data1: >>>> Restarting >>>> receiver thread >>>> May 23 11:32:32 data2 kernel: [ 740.458723] drbd r0 data1: conn( >>>> Unconnected -> Connecting ) >>>> >>>> resources: >>>> >>>> resource r0 { >>>> on data1 { >>>> device /dev/drbd0; >>>> disk /dev/mapper/mapper_secure; >>>> address 172.16.11.21:7789 >>>> <http://172.16.11.21:7789>; >>>> meta-disk internal; >>>> } >>>> on data2 { >>>> device /dev/drbd0; >>>> disk /dev/mapper/mapper_secure; >>>> address 172.16.11.22:7789 >>>> <http://172.16.11.22:7789>; >>>> meta-disk internal; >>>> } >>>> } >>>> >>>> drbd configuration: >>>> >>>> global { >>>> usage-count yes; >>>> } >>>> >>>> common { >>>> #handlers { >>>> # fence-peer "/usr/lib/drbd/crm-fence-peer.9.sh >>>> <http://crm-fence-peer.9.sh>"; >>>> # after-resync-target >>>> "/usr/lib/drbd/crm-unfence-peer.9.sh >>>> <http://crm-unfence-peer.9.sh>"; >>>> #} >>>> #disk { >>>> # on-io-error detach; >>>> # disk-barrier no; >>>> # disk-flushes no; >>>> # al-extents 3833; >>>> # c-plan-ahead 7; >>>> # c-fill-target 2M; >>>> # c-min-rate 80M; >>>> # c-max-rate 720M; >>>> #} >>>> net { >>>> protocol C; >>>> #fencing resource-only; >>>> #cram-hmac-alg sha1; >>>> #verify-alg sha1; >>>> #shared-secret 1e69dc721fd2e65368ae3ba1e5929979; >>>> #after-sb-0pri disconnect; >>>> #after-sb-1pri disconnect; >>>> #after-sb-2pri disconnect; >>>> #max-buffers 8000; >>>> #max-epoch-size 8000; >>>> #sndbuf-size 0; >>>> #rcvbuf-size 2048k; >>>> } >>>> } >>>> >>>> >>>> >>>> _______________________________________________ >>>> drbd-user mailing list >>>> [email protected] <mailto:[email protected]> >>>> http://lists.linbit.com/mailman/listinfo/drbd-user >>>> <http://lists.linbit.com/mailman/listinfo/drbd-user> >>>> >>>> >>> >>> -- >>> ProActive Software >>> Dirk Bonenkamp >>> CTO <https://www.proactive-software.com> >>> Phone: +31 (0)23 54 222 99 >>> Mobile: +31 (0)6 250 787 93 Richard Holkade 9 >>> 2033 PZ Haarlem >>> LinkedIn <http://linkd.in/1V6egnk> Facebook >>> <http://bit.ly/FBProActive> YouTube <http://bit.ly/1Mc23L9> >>> www.proactive.nl <https://www.proactive.nl> >>> >>> >>> >>> _______________________________________________ >>> drbd-user mailing list >>> [email protected] >>> http://lists.linbit.com/mailman/listinfo/drbd-user >> >> _______________________________________________ >> drbd-user mailing list >> [email protected] >> http://lists.linbit.com/mailman/listinfo/drbd-user > > -- > ProActive Software > Dirk Bonenkamp > CTO <https://www.proactive-software.com> > Phone: +31 (0)23 54 222 99 > Mobile: +31 (0)6 250 787 93 Richard Holkade 9 > 2033 PZ Haarlem > LinkedIn <http://linkd.in/1V6egnk> Facebook > <http://bit.ly/FBProActive> YouTube <http://bit.ly/1Mc23L9> > www.proactive.nl <https://www.proactive.nl> > > > > _______________________________________________ > drbd-user mailing list > [email protected] > http://lists.linbit.com/mailman/listinfo/drbd-user -- ProActive Software Dirk Bonenkamp CTO <https://www.proactive-software.com> Phone: +31 (0)23 54 222 99 Mobile: +31 (0)6 250 787 93 Richard Holkade 9 2033 PZ Haarlem LinkedIn <http://linkd.in/1V6egnk> Facebook <http://bit.ly/FBProActive> YouTube <http://bit.ly/1Mc23L9> www.proactive.nl <https://www.proactive.nl>
_______________________________________________ drbd-user mailing list [email protected] http://lists.linbit.com/mailman/listinfo/drbd-user
