Hi Reuben, thank you for digging into this issue reported by some users, now.
I wonder if there are users out there using master/master replication with 2.3.1 dovecot that do *NOT* observe the errors mentioned in this thread? Please report, please. And, pleas have a closer look for It would be interesting to know if these failures are seen by a subset of users, only. And please have a look for processes like: doveadm-server: [IP4 <user> INBOX import:1/3] (doveadm-server) These processes will "survive" a dovecot reboot ... With kind regards, Michael > On 6. May 2018, at 16:11, Reuben Farrelly <reuben-dove...@reub.net> wrote: > > Ah yes. The hash I put below is wrong but the commit message I quoted was > right. This is the last known working good commit, which was on Mar 6: > > https://github.com/dovecot/core/commit/ff5d02182573b1d4306c2a8c36605c98f217ef3b > "doveadm: Unref header search context after use > Fixes memory leak, found by valgrind" > > I've subsequently been able to determine that the bad commit was one of the > dozens which were committed in a big batch on Mar 12, 2018. I definitely see > the problem happening when running with commit > 6418419ec282c887b67469dbe3f541fc4873f7f0 (last one in that batch). > > In other words it's one of the commits between those two hashes that is the > problem. Just need to work out which one........... > > The problem is also seen with -git as of a day or so ago. I'm tracking > master-2.3 so I imagine I'm picking up all the 2.3 fixes as they go straight > in (including the rc fixes). So this doesn't appear to be fixed yet. > > Reuben > > > > On 6/05/2018 11:41 pm, Aki Tuomi wrote: >> Seems you got the hash wrong... >> You probably mean 4a1e2e8e85ad5d4ab43496a5f3cd6b7ff6ac5955, which fixes a >> memory leak, but does not release sync lock. >> Interesting if this commit fixes the issue. >> Looking at the sync code, where the lock is made / released, there is only >> one commit recently, which is 8077d714e11388a294f1583e706152396972acce and >> that one creates home directory if missing. >> Would be interesting if you could test with 2.3.2rc1, which was released few >> days ago, whether this issue goes away? >> Aki >>> On 06 May 2018 at 15:21 Reuben Farrelly <reuben-dove...@reub.net> wrote: >>> >>> >>> Hi Andy, >>> >>> Funny you say that - I've been testing this very problem for some time >>> today to narrow down the commits where I think the problem started >>> occurring. >>> >>> So far I have been able to roll forward to commit from git: >>> 74ef8506dcb946acc42722683e49fdcb964eed19 >>> >>> >> "doveadm: Unref header search context after use" . So far I have >>> been running on that code for a few hours and I'm not yet seeing the >>> problem, which indicates the issue is probably subsequent to this. >>> >>> I'm of the view that the problem started occurring somewhere between >>> that commit above on 6 March and the final 2.3.1 release on 27 March. >>> There are a lot of changes between those two dates, and testing to >>> narrow down to one specific commit is extremely time consuming. >>> >>> It would be really helpful if even one of the Dovecot developers could >>> comment on this and give us some direction (or candidate commits to >>> investigate) or just let us know if the root cause has been found. This >>> bug is a showstopper and has stopped me tracking master-2.3 for over 3 >>> months now, as I can't test later builds or even upgrade to the 2.3.1 >>> release while replication is so broken. >>> >>> Reuben >>> >>> >>> >>>> Message: 1 >>>> Date: Sun, 6 May 2018 13:21:57 +1000 >>>> From: Andy Weal <a...@bizemail.com.au> >>>> To: dovecot@dovecot.org >>>> Subject: 2.3.1 Replication is throwing scary errors >>>> Message-ID: <f4274128-350b-9a74-a344-8e6caeffa...@bizemail.com.au> >>>> Content-Type: text/plain; charset="utf-8"; Format="flowed" >>>> >>>> Hi all, >>>> >>>> New to the mailing lists but have joined up because of above */2.3.1 >>>> Replication is throwing scary errors >>>> >>>> >>>> /*Brief system configuration >>>> ??? MX1 - Main >>>> ??? ??? Freebsd 11.1-Release-p9 >>>> ??? ??? Hosted on a Vultr VM in Sydney AU >>>> ??? ??? MTA = Postfix 3.4-20180401 >>>> ??? ??? Dovecot = 2.3.1 >>>> ??? ??? File system = ufs >>>> ??? MX2 - Backup >>>> ??? ??? Freebsd 11.1-Release-p9 >>>> ??? ???? Running on bare metal - no VM or jails >>>> ??? ??? MTA = Postfix 3.4-20180401 >>>> ??? ??? Dovecot = 2.3.1 >>>> ??? ??? File system = ufs ( on SSD) >>>> >>>> */ >>>> /*Brief sequence of events >>>> >>>> 1. ??? apx 10 days back upgraded both mx1 and mx2 to dovecot 2.3.1_2 >>>> from 2.3.0 ? (service dovecot stop, portmaster upgrade, service >>>> dovecot start) >>>> 2. ??? both systems ran ok with no errors for 10 days. >>>> 3. ??? Last night I shutdown mx2 and restarted it a few hours later >>>> 4. ??? within minutes i was getting the following types of errors on mx2 >>>> >>>> ???????? May 06 12:56:29 doveadm: Error: Couldn't lock >>>> /var/mail/vhosts/example.net/user1/.dovecot-sync.lock: >>>> fcntl(/var/mail/vhosts/example.net/user1/.dovecot-sync.lock, write-lock, >>>> F_SETLKW) locking failed: Timed out after 30 seconds (WRITE lock held by >>>> pid 1960) >>>> >>>> ? Before i venture down the rabbit hole of fault finding and excess >>>> coffee consumption I was wondering if any of you had any updates on the >>>> problems discussed below. >>>> >>>> >>>> Cheers for now, >>>> Andy >>>> >>>> >>>> >>>> Hi, >>>> >>>> [Formatting is a bit rough, replying from a trimmed digest email] >>>> >>>>> /Message: 1 />/Date: Fri, 6 Apr 2018 15:04:35 +0200 />/From: Michael >>>>> Grimm <trashcan at ellael.org >>>> <https://dovecot.org/mailman/listinfo/dovecot>> />/To: Dovecot Mailing >>>> List <dovecot at dovecot.org >>>> <https://dovecot.org/mailman/listinfo/dovecot>> />/Subject: Re: 2.3.1 >>>> Replication is throwing scary errors />/Message-ID: >>>> <E7E7A927-68F8-443F-BA59-E66CED8FE878 at ellael.org >>>> <https://dovecot.org/mailman/listinfo/dovecot>> />/Content-Type: >>>> text/plain; charset=utf-8 />//>/Reuben Farrelly wrote: />>/From: Michael >>>> Grimm <trashcan at ellael.org >>>> <https://dovecot.org/mailman/listinfo/dovecot>> />//>>>/[This is Dovecot >>>> 2.3.1 at FreeBSD STABLE-11.1 running in two jails at >>>> distinct servers.] />>>/I did upgrade from 2.2.35 to 2.3.1 today, and I do >>>> become pounded by >>>> error messages at server1 (and vice versa at server2) as follows: />>>/| >>>> Apr 2 17:12:18 <mail.err> server1.lan dovecot: doveadm: Error: >>>> dsync(server2.lan): I/O has stalled, \ />>>/no activity for 600 seconds >>>> (last sent=mail_change, last >>>> recv=mail_change (EOL)) />>>/| Apr 2 17:12:18 <mail.err> server1.lan >>>> dovecot: doveadm: Error: Timeout >>>> during state=sync_mails \ />>>/(send=changes recv=mail_requests) />>>/[?] >>>> />>>/| Apr 2 18:59:03 <mail.err> server1.lan dovecot: doveadm: Error: >>>> dsync(server2.lan): I/O has stalled, \ />>>/no activity for 600 seconds >>>> (last sent=mail, last recv=mail (EOL)) />>>/| Apr 2 18:59:03 <mail.err> >>>> server1.lan dovecot: doveadm: Error: Timeout >>>> during state=sync_mails \ />>>/(send=mails recv=recv_last_common) />>>/I >>>> cannot see in my personal account any missing replications, *but* I >>>> haven't tested this thoroughly enough. I do have customers being >>>> serviced at these productive servers, *thus* I'm back to 2.2.35 until I >>>> do understand or have learned what is going on. />//>/In my reply to this >>>> statement of mine I mentioned that I have seen those >>>> timeouts quite some times during the past year. Thus, I upgraded to >>>> 2.3.1 again, and boom: after some hours I ended up in hanging processes >>>> [1] like (see Remko's mail in addition) ... />//>/doveadm-server: [IP4/6 >>>> <USER1> SOME/MAILBOX import:0/0] (doveadm-server) />//>/? at server2 >>>> paired with a file like ? />//>/-rw------- 1 vmail dovecot uarch 0 Apr 3 >>>> 16:52 >>>> /home/to/USER1/.dovecot-sync.lock />//>/Corresponding logfile entries at >>>> server2 are like ? />//>/Apr 3 17:10:49 <mail.err> server2.lan dovecot: >>>> doveadm: Error: Couldn't >>>> lock /home/to/USER1/.dovecot-sync.lock: \ >>>> />/fcntl(/home/to/USER1/.dovecot-sync.lock, write-lock, F_SETLKW) locking >>>> failed: Timed out after 30 seconds \ />/(WRITE lock held by pid 51110) >>>> />//>/[1] Even stopping dovecot will not end those processes. One has to >>>> manually kill those before restarting dovecot. />//>/After one day of >>>> testing 2.3.1 with a couple of those episodes of >>>> locking/timeout, and now missing mails depending with server your MUA >>>> will connect to, I went back to 2.2.35. After two days at that version I >>>> never had such an episode again. />//>>/It's not just you. This issue hit >>>> me recently, and it was impacting />>/replication noticeably. I am >>>> following git master-2.3 . />/[...] />>/There is also a second issue of a >>>> long standing race with replication />>/occurring somewhere whereby if a >>>> mail comes in, is written to disk, is />>/replicated and then deleted in >>>> short succession, it will reappear />>/again to the MUA. I suspect the >>>> mail is being replicated back from />>/the remote. A few people have >>>> reported it over the years but it's not />>/reliable or consistent, so it >>>> has never been fixed. />>/And lastly there has been an ongoing but >>>> seemingly minor issue />>/relating to locking timing out after 30s >>>> particularly on the remote />>/host that is being replicated to. I rarely >>>> see the problem on my />>/local disk where almost all of the mail comes >>>> in, it's almost always />>/occurring on the replicate/remote system. >>>> />//>/It might be time to describe our setups in order to possibly find co >>>> mmon >>>> grounds that might trigger this issue you describe and Rimko and myself >>>> ran into as well. />//>/Servers: Cloud Instances (both identical), around >>>> 25ms latency apart. />/Intel Core Processor (Haswell, no TSX) (3092.91-MHz >>>> K8-class CPU) />/Both servers are connected via IPsec/racoon tunnels >>>> />/OS: FreeBSD 11.1-STABLE (both servers) />/Filesystem: ZFS />/MTA: >>>> postfix 3.4-20180401 (postfix delivers via dovecot's LMTP) />/IMAP: >>>> dovecot running in FreeBSD jails (issues with 2.3.1, fine with >>>> 2.2.35) />/Replication: unsecured tcp / master-master />/MUA: mainly iOS >>>> or macOS mail.app, rarely roundcube / >>>> For me: >>>> >>>> Servers: Main: VM on a VMWare ESXi local system (light >>>> load), local SSD disks (no NFS) >>>> Redundant: Linode VM in Singapore, around 250ms away >>>> Also no NFS. Linode use SSDs for IO. >>>> There is native IPv6 connectivity between both VMs. >>>> As I am using TCPs I don't have a VPN between them - >>>> just raw IPv6 end to end. >>>> OS: Gentoo Linux x86_64 kept well up to date >>>> Filesystem: EXT4 for both >>>> MTA: Postfix 3.4-x (same as you) >>>> IMAP: Dovecot running natively on the machine (no jail/chroot) >>>> Replication: Secured TCP / master-master (but only one master is >>>> actively used for SMTP and MUA access) >>>> MUA: Thunderbird, Maildroid, MS Mail, Gmail client, >>>> Roundcube, LG Mail client on handset etc >>>> >>>>> /I believe it is worthwhile to mention here that I run a poor man's >>>> fail-over approach (round-robin DNS) as follows: />//>/DNS: >>>> mail.server.tld resolves to one IP4 and one IP6 address of each >>>> server, thus 4 IP addresses in total / >>>> For me I have a simple pair of servers that replicate mail to each >>>> other. Clients connect to the main server only, which is where mail >>>> comes in. The second one that experiences the problem with locking is >>>> only a redundant server which is a secondary MX, that only gets >>>> occasional SMTP deliveries from time to time. So the amount of >>>> replication back from the "secondary" is very minimal; the vast bulk of >>>> the replication is the primary replicating changes out. >>>> >>>> Reuben >>>> >