Hi! This is probably going to get fixed in 2.3.19, this looks like an issue we are already fixing.
Aki > On 26/04/2022 16:38 Paul Kudla (SCOM.CA Internet Services Inc.) > <p...@scom.ca> wrote: > > > Agreed there seems to be no way of posting these kinds of issues to see > if they are even being addressed or even known about moving forward on > new updates > > i read somewhere there is a new branch soming out but nothing as of yet? > > 2.4 maybe .... > 5.0 ........ > > my previous replication issues (back in feb) went unanswered. > > not faulting anyone, but the developers do seem to be disconnected from > issues as of late? or concentrating on other issues. > > I have no problem with support contracts for day to day maintence > however as a programmer myself they usually dont work as the other end > relies on the latest source code anyways. Thus can not help. > > I am trying to take a part the replicator c programming based on 2.3.18 > as most of it does work to some extent. > > tcps just does not work (ie 600 seconds default in the c programming) > > My thoughts are tcp works ok but fails when the replicator through > dsync-client.c when asked to return the folder list? > > > replicator-brain.c seems to control the overall process and timing. > > replicator-queue.c seems to handle the que file that does seem to carry > acurate info. > > > things in the source code are documented enough to figure this out but i > am still going through all the related .h files documentation wise which > are all over the place. > > there is no clear documentation on the .h lib files so i have to walk > through the tree one at a time finding relative code. > > since the dsync from doveadm does see to work ok i have to assume the > dsync-client used to compile the replicator is at fault somehow or a > call from it upstream? > > Thanks for your input on the other issues noted below, i will keep that > in mind when disassembling the source code. > > No sense in fixing one thing and leaving something else behind, probably > all related anyways. > > i have two test servers avaliable so i can play with all this offline to > reproduce the issues > > Unfortunately I have to make a living first, this will be addressed when > possible as i dont like systems that are live running this way and > currently only have 5 accounts with this issue (mine included) > > > > > Happy Tuesday !!! > Thanks - paul > > Paul Kudla > > > Scom.ca Internet Services <http://www.scom.ca> > 004-1009 Byron Street South > Whitby, Ontario - Canada > L1N 4S3 > > Toronto 416.642.7266 > Main 1.866.411.7266 > Fax 1.888.892.7266 > > On 4/26/2022 9:03 AM, Reuben Farrelly wrote: > > > > I ran into this back in February and documented a reproducible test case > > (and sent it to this list). In short - I was able to reproduce this by > > having a valid and consistent mailbox on the source/local, creating a > > very standard empty Maildir/(new|cur|tmp) folder on the remote replica, > > and then initiating the replicate from the source. This consistently > > caused dsync to fail replication with the error "dovecot.index reset, > > view is now inconsistent" and sync aborted, leaving the replica mailbox > > in a screwed up inconsistent state. Client connections on the source > > replica were also dropped when this error occurred. You can see the > > error by enabling debug level logging if you initiate dsync manually on > > a test mailbox. > > > > The only workaround I found was to remove the remote Maildir and let > > Dovecot create the whole thing from scratch. Dovecot did not like any > > existing folders on the destination replica even if they were the same > > names as the source and completely empty. I was able to reproduce this > > the bare minimum of folders - just an INBOX! > > > > I have no idea if any of the developers saw my post or if the bug has > > been fixed for the next release. But it seemed to be quite a common > > problem over time (saw a few posts from people going back a long way > > with the same problem) and it is seriously disruptive to clients. The > > error message is not helpful in tracking down the problem either. > > > > Secondly, I also have had an ongoing and longstanding problem using > > tcps: for replication. For some reason using tcps: (with no other > > changes at all to the config) results in a lot of timeout messages > > "Error: dsync I/O has stalled, no activity for 600 seconds". This goes > > away if I revert back to tcp: instead of tcps - with tcp: I very rarely > > get timeouts. No idea why, guess this is a bug of some sort also. > > > > It's disappointing that there appears to be no way to have these sorts > > or problems addressed like there once was. I am not using Dovecot for > > commercial purposes so paying a fortune for a support contract for a > > high end installation just isn't going to happen, and this list seems to > > be quite ordinary for getting support and reporting bugs nowadays.... > > > > Reuben > > > > On 26/04/2022 7:21 pm, Paul Kudla (SCOM.CA Internet Services Inc.) wrote: > > > >> > >> side issue > >> > >> if you are getting inconsistant dsyncs there is no real way to fix > >> this in the long run. > >> > >> i know its a pain (already had to my self) > >> > >> i needed to do a full sync, take one server offline, delete the user > >> dir (with dovecot offline) and then rsync (or somehow duplicate the > >> main server's user data) over the the remote again. > >> > >> then bring remote back up and it kind or worked worked > >> > >> best suggestion is to bring the main server down at night so the copy > >> is clean? > >> > >> if using postfix you can enable the soft bounce option and the mail > >> will back spool until everything comes back online > >> > >> (needs to be enable on bother servers) > >> > >> replication was still an issue on accounts with 300+ folders in them, > >> still working on a fix for that. > >> > >> > >> Happy Tuesday !!! > >> Thanks - paul > >> > >> Paul Kudla > >> > >> > >> Scom.ca Internet Services <http://www.scom.ca> > >> 004-1009 Byron Street South > >> Whitby, Ontario - Canada > >> L1N 4S3 > >> > >> Toronto 416.642.7266 > >> Main 1.866.411.7266 > >> Fax 1.888.892.7266 > >> > >> On 4/25/2022 10:01 AM, Arnaud Abélard wrote: > >>> Ah, I'm now getting errors in the logs, that would explains the > >>> increasing number of failed sync requests: > >>> > >>> dovecot: imap(xxxxx)<2961235><Bs6w43rdQPAqAcsFiXEmAInUhhA3Rfqh>: > >>> Error: Mailbox INBOX: /vmail/l/i/xxxxx/dovecot.index reset, view is > >>> now inconsistent > >>> > >>> > >>> And sure enough: > >>> > >>> # dovecot replicator status xxxxx > >>> > >>> xxxxx none 00:02:54 07:11:28 - y > >>> > >>> > >>> What could explain that error? > >>> > >>> Arnaud > >>> > >>> > >>> > >>> On 25/04/2022 15:13, Arnaud Abélard wrote: > >>>> Hello, > >>>> > >>>> On my side we are running Linux (Debian Buster). > >>>> > >>>> I'm not sure my problem is actually the same as Paul or you > >>>> Sebastian since I have a lot of boxes but those are actually small > >>>> (quota of 110MB) so I doubt any of them have more than a dozen imap > >>>> folders. > >>>> > >>>> The main symptom is that I have tons of full sync requests awaiting > >>>> but even though no other sync is pending the replicator just waits > >>>> for something to trigger those syncs. > >>>> > >>>> Today, with users back I can see that normal and incremental syncs > >>>> are being done on the 15 connections, with an occasional full sync > >>>> here or there and lots of "Waiting 'failed' requests": > >>>> > >>>> Queued 'sync' requests 0 > >>>> > >>>> Queued 'high' requests 0 > >>>> > >>>> Queued 'low' requests 0 > >>>> > >>>> Queued 'failed' requests 122 > >>>> > >>>> Queued 'full resync' requests 28785 > >>>> > >>>> Waiting 'failed' requests 4294 > >>>> > >>>> Total number of known users 42512 > >>>> > >>>> > >>>> > >>>> So, why didn't the replicator take advantage of the weekend to > >>>> replicate the mailboxes while no user were using them? > >>>> > >>>> Arnaud > >>>> > >>>> > >>>> > >>>> > >>>> On 25/04/2022 13:54, Sebastian Marske wrote: > >>>>> Hi there, > >>>>> > >>>>> thanks for your insights and for diving deeper into this Paul! > >>>>> > >>>>> For me, the users ending up in 'Waiting for dsync to finish' all have > >>>>> more than 256 Imap folders as well (ranging from 288 up to >5500; > >>>>> as per > >>>>> 'doveadm mailbox list -u <username> | wc -l'). For more details on my > >>>>> setup please see my post from February [1]. > >>>>> > >>>>> @Arnaud: What OS are you running on? > >>>>> > >>>>> > >>>>> Best > >>>>> Sebastian > >>>>> > >>>>> > >>>>> [1] https://dovecot.org/pipermail/dovecot/2022-February/124168.html > >>>>> > >>>>> > >>>>> On 4/24/22 19:36, Paul Kudla (SCOM.CA Internet Services Inc.) wrote: > >>>>>> > >>>>>> Question having similiar replication issues > >>>>>> > >>>>>> pls read everything below and advise the folder counts on the > >>>>>> non-replicated users? > >>>>>> > >>>>>> i find the total number of folders / account seems to be a factor > >>>>>> and > >>>>>> NOT the size of the mail box > >>>>>> > >>>>>> ie i have customers with 40G of emails no problem over 40 or so > >>>>>> folders > >>>>>> and it works ok > >>>>>> > >>>>>> 300+ folders seems to be the issue > >>>>>> > >>>>>> i have been going through the replication code > >>>>>> > >>>>>> no errors being logged > >>>>>> > >>>>>> i am assuming that the replication --> dhclient --> other server is > >>>>>> timing out or not reading the folder lists correctly (ie dies after X > >>>>>> folders read) > >>>>>> > >>>>>> thus i am going through the code patching for log entries etc to find > >>>>>> the issues. > >>>>>> > >>>>>> see > >>>>>> > >>>>>> [13:33:57] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot > >>>>>> # ll > >>>>>> total 86 > >>>>>> drwxr-xr-x 2 root wheel uarch 4B Apr 24 11:11 . > >>>>>> drwxr-xr-x 4 root wheel uarch 4B Mar 8 2021 .. > >>>>>> -rw-r--r-- 1 root wheel uarch 73B Apr 24 11:11 instances > >>>>>> -rw-r--r-- 1 root wheel uarch 160K Apr 24 13:33 replicator.db > >>>>>> > >>>>>> [13:33:58] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot > >>>>>> # > >>>>>> > >>>>>> replicator.db seems to get updated ok but never processed properly. > >>>>>> > >>>>>> # sync.users > >>>>>> n...@elirpa.com high 00:09:41 463:47:01 - y > >>>>>> ke...@elirpa.com high 00:09:23 463:45:43 - y > >>>>>> p...@scom.ca high 00:09:41 463:46:51 - y > >>>>>> e...@scom.ca high 00:09:43 463:47:01 - y > >>>>>> ed.ha...@dssmgmt.com high 00:09:42 463:46:58 - y > >>>>>> p...@paulkudla.net high 00:09:44 463:47:03 > >>>>>> 580:35:07 > >>>>>> y > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> so .... > >>>>>> > >>>>>> > >>>>>> > >>>>>> two things : > >>>>>> > >>>>>> first to get the production stuff to work i had to write a script > >>>>>> that > >>>>>> whould find the bad sync's and the force a dsync between the servers > >>>>>> > >>>>>> i run this every five minutes or each server. > >>>>>> > >>>>>> in crontab > >>>>>> > >>>>>> */10 * * * * root /usr/bin/nohup > >>>>>> /programs/common/sync.recover > /dev/null > >>>>>> > >>>>>> > >>>>>> python script to sort things out > >>>>>> > >>>>>> # cat /programs/common/sync.recover > >>>>>> #!/usr/local/bin/python3 > >>>>>> > >>>>>> #Force sync between servers that are reporting bad? > >>>>>> > >>>>>> import os,sys,django,socket > >>>>>> from optparse import OptionParser > >>>>>> > >>>>>> > >>>>>> from lib import * > >>>>>> > >>>>>> #Sample Re-Index MB > >>>>>> #doveadm -D force-resync -u p...@scom.ca -f INBOX* > >>>>>> > >>>>>> > >>>>>> > >>>>>> USAGE_TEXT = '''\ > >>>>>> usage: %%prog %s[options] > >>>>>> ''' > >>>>>> > >>>>>> parser = OptionParser(usage=USAGE_TEXT % '', version='0.4') > >>>>>> > >>>>>> parser.add_option("-m", "--send_to", dest="send_to", help="Send > >>>>>> Email To") > >>>>>> parser.add_option("-e", "--email", dest="email_box", help="Box to > >>>>>> Index") > >>>>>> parser.add_option("-d", "--detail",action='store_true', > >>>>>> dest="detail",default =False, help="Detailed report") > >>>>>> parser.add_option("-i", "--index",action='store_true', > >>>>>> dest="index",default =False, help="Index") > >>>>>> > >>>>>> options, args = parser.parse_args() > >>>>>> > >>>>>> print (options.email_box) > >>>>>> print (options.send_to) > >>>>>> print (options.detail) > >>>>>> > >>>>>> #sys.exit() > >>>>>> > >>>>>> > >>>>>> > >>>>>> print ('Getting Current User Sync Status') > >>>>>> command = commands("/usr/local/bin/doveadm replicator status '*'") > >>>>>> > >>>>>> > >>>>>> #print command > >>>>>> > >>>>>> sync_user_status = command.output.split('\n') > >>>>>> > >>>>>> #print sync_user_status > >>>>>> > >>>>>> synced = [] > >>>>>> > >>>>>> for n in range(1,len(sync_user_status)) : > >>>>>> user = sync_user_status[n] > >>>>>> print ('Processing User : %s' %user.split(' ')[0]) > >>>>>> if user.split(' ')[0] != options.email_box : > >>>>>> if options.email_box != None : > >>>>>> continue > >>>>>> > >>>>>> if options.index == True : > >>>>>> command = '/usr/local/bin/doveadm -D force-resync > >>>>>> -u %s > >>>>>> -f INBOX*' %user.split(' ')[0] > >>>>>> command = commands(command) > >>>>>> command = command.output > >>>>>> > >>>>>> #print user > >>>>>> for nn in range (len(user)-1,0,-1) : > >>>>>> #print nn > >>>>>> #print user[nn] > >>>>>> > >>>>>> if user[nn] == '-' : > >>>>>> #print 'skipping ... %s' %user.split(' ')[0] > >>>>>> > >>>>>> break > >>>>>> > >>>>>> > >>>>>> > >>>>>> if user[nn] == 'y': #Found a Bad Mailbox > >>>>>> print ('syncing ... %s' %user.split(' ')[0]) > >>>>>> > >>>>>> > >>>>>> if options.detail == True : > >>>>>> command = '/usr/local/bin/doveadm -D > >>>>>> sync -u %s -d -N -l 30 -U' %user.split(' ')[0] > >>>>>> print (command) > >>>>>> command = commands(command) > >>>>>> command = command.output.split('\n') > >>>>>> print (command) > >>>>>> print ('Processed Mailbox for ... > >>>>>> %s' > >>>>>> %user.split(' ')[0] ) > >>>>>> synced.append('Processed Mailbox > >>>>>> for ... > >>>>>> %s' %user.split(' ')[0]) > >>>>>> for nnn in range(len(command)): > >>>>>> synced.append(command[nnn] + '\n') > >>>>>> break > >>>>>> > >>>>>> > >>>>>> if options.detail == False : > >>>>>> #command = > >>>>>> '/usr/local/bin/doveadm -D > >>>>>> sync -u %s -d -N -l 30 -U' %user.split(' ')[0] > >>>>>> #print (command) > >>>>>> #command = os.system(command) > >>>>>> command = subprocess.Popen( > >>>>>> ["/usr/local/bin/doveadm sync -u %s -d -N -l 30 -U" %user.split(' > >>>>>> ')[0] > >>>>>> ], \ > >>>>>> shell = True, stdin=None, > >>>>>> stdout=None, > >>>>>> stderr=None, close_fds=True) > >>>>>> > >>>>>> print ( 'Processed Mailbox for > >>>>>> ... %s' > >>>>>> %user.split(' ')[0] ) > >>>>>> synced.append('Processed Mailbox > >>>>>> for ... > >>>>>> %s' %user.split(' ')[0]) > >>>>>> #sys.exit() > >>>>>> break > >>>>>> > >>>>>> if len(synced) != 0 : > >>>>>> #send email showing bad synced boxes ? > >>>>>> > >>>>>> if options.send_to != None : > >>>>>> send_from = 'moni...@scom.ca' > >>>>>> send_to = ['%s' %options.send_to] > >>>>>> send_subject = 'Dovecot Bad Sync Report for : %s' > >>>>>> %(socket.gethostname()) > >>>>>> send_text = '\n\n' > >>>>>> for n in range (len(synced)) : > >>>>>> send_text = send_text + synced[n] + '\n' > >>>>>> > >>>>>> send_files = [] > >>>>>> sendmail (send_from, send_to, send_subject, > >>>>>> send_text, > >>>>>> send_files) > >>>>>> > >>>>>> > >>>>>> > >>>>>> sys.exit() > >>>>>> > >>>>>> second : > >>>>>> > >>>>>> i posted this a month ago - no response > >>>>>> > >>>>>> please appreciate that i am trying to help .... > >>>>>> > >>>>>> after much testing i can now reporduce the replication issues at hand > >>>>>> > >>>>>> I am running on freebsd 12 & 13 stable (both test and production > >>>>>> servers) > >>>>>> > >>>>>> sdram drives etc ... > >>>>>> > >>>>>> Basically replication works fine until reaching a folder quantity > >>>>>> of ~ > >>>>>> 256 or more > >>>>>> > >>>>>> to reproduce using doveadm i created folders like > >>>>>> > >>>>>> INBOX/folder-0 > >>>>>> INBOX/folder-1 > >>>>>> INBOX/folder-2 > >>>>>> INBOX/folder-3 > >>>>>> and so forth ...... > >>>>>> > >>>>>> I created 200 folders and they replicated ok on both servers > >>>>>> > >>>>>> I created another 200 (400 total) and the replicator got stuck and > >>>>>> would > >>>>>> not update the mbox on the alternate server anymore and is still > >>>>>> updating 4 days later ? > >>>>>> > >>>>>> basically replicator goes so far and either hangs or more likely > >>>>>> bails > >>>>>> on an error that is not reported to the debug reporting ? > >>>>>> > >>>>>> however dsync will sync the two servers but only when run manually > >>>>>> (ie > >>>>>> all the folders will sync) > >>>>>> > >>>>>> I have two test servers avaliable if you need any kind of access - > >>>>>> again > >>>>>> here to help. > >>>>>> > >>>>>> [07:28:42] mail18.scom.ca [root:0] ~ > >>>>>> # sync.status > >>>>>> Queued 'sync' requests 0 > >>>>>> Queued 'high' requests 6 > >>>>>> Queued 'low' requests 0 > >>>>>> Queued 'failed' requests 0 > >>>>>> Queued 'full resync' requests 0 > >>>>>> Waiting 'failed' requests 0 > >>>>>> Total number of known users 255 > >>>>>> > >>>>>> username type status > >>>>>> p...@scom.ca normal Waiting for dsync to > >>>>>> finish > >>>>>> ke...@elirpa.com incremental Waiting for dsync to > >>>>>> finish > >>>>>> ed.ha...@dssmgmt.com incremental Waiting for dsync to > >>>>>> finish > >>>>>> e...@scom.ca incremental Waiting for dsync to > >>>>>> finish > >>>>>> n...@elirpa.com incremental Waiting for dsync to > >>>>>> finish > >>>>>> p...@paulkudla.net incremental Waiting for dsync to > >>>>>> finish > >>>>>> > >>>>>> > >>>>>> i have been going through the c code and it seems the replication > >>>>>> gets > >>>>>> requested ok > >>>>>> > >>>>>> replicator.db does get updated ok with the replicated request for the > >>>>>> mbox in question. > >>>>>> > >>>>>> however i am still looking for the actual replicator function in the > >>>>>> lib's that do the actual replication requests > >>>>>> > >>>>>> the number of folders & subfolders is defanately the issue - not the > >>>>>> mbox pyhsical size as thought origionally. > >>>>>> > >>>>>> if someone can point me in the right direction, it seems either the > >>>>>> replicator is not picking up on the number of folders to replicat > >>>>>> properly or it has a hard set limit like 256 / 512 / 65535 etc and > >>>>>> stops > >>>>>> the replication request thereafter. > >>>>>> > >>>>>> I am mainly a machine code programmer from the 80's and have > >>>>>> concentrated on python as of late, 'c' i am starting to go through > >>>>>> just > >>>>>> to give you a background on my talents. > >>>>>> > >>>>>> It took 2 months to finger this out. > >>>>>> > >>>>>> this issue also seems to be indirectly causing the duplicate messages > >>>>>> supression not to work as well. > >>>>>> > >>>>>> python programming to reproduce issue (loops are for last run > >>>>>> started @ > >>>>>> 200 - fyi) : > >>>>>> > >>>>>> # cat mbox.gen > >>>>>> #!/usr/local/bin/python2 > >>>>>> > >>>>>> import os,sys > >>>>>> > >>>>>> from lib import * > >>>>>> > >>>>>> > >>>>>> user = 'p...@paulkudla.net' > >>>>>> > >>>>>> """ > >>>>>> for count in range (0,600) : > >>>>>> box = 'INBOX/folder-%s' %count > >>>>>> print count > >>>>>> command = '/usr/local/bin/doveadm mailbox create -s -u %s > >>>>>> %s' > >>>>>> %(user,box) > >>>>>> print command > >>>>>> a = commands.getoutput(command) > >>>>>> print a > >>>>>> """ > >>>>>> > >>>>>> for count in range (0,600) : > >>>>>> box = 'INBOX/folder-0/sub-%' %count > >>>>>> print count > >>>>>> command = '/usr/local/bin/doveadm mailbox create -s -u %s > >>>>>> %s' > >>>>>> %(user,box) > >>>>>> print command > >>>>>> a = commands.getoutput(command) > >>>>>> print a > >>>>>> > >>>>>> > >>>>>> > >>>>>> #sys.exit() > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> Happy Sunday !!! > >>>>>> Thanks - paul > >>>>>> > >>>>>> Paul Kudla > >>>>>> > >>>>>> > >>>>>> Scom.ca Internet Services <http://www.scom.ca> > >>>>>> 004-1009 Byron Street South > >>>>>> Whitby, Ontario - Canada > >>>>>> L1N 4S3 > >>>>>> > >>>>>> Toronto 416.642.7266 > >>>>>> Main 1.866.411.7266 > >>>>>> Fax 1.888.892.7266 > >>>>>> > >>>>>> On 4/24/2022 10:22 AM, Arnaud Abélard wrote: > >>>>>>> Hello, > >>>>>>> > >>>>>>> I am working on replicating a server (and adding compression on the > >>>>>>> other side) and since I had "Error: dsync I/O has stalled, no > >>>>>>> activity > >>>>>>> for 600 seconds (version not received)" errors I upgraded both > >>>>>>> source > >>>>>>> and destination server with the latest 2.3 version (2.3.18). While > >>>>>>> before the upgrade all the 15 replication connections were busy > >>>>>>> after > >>>>>>> upgrading dovecot replicator dsync-status shows that most of the > >>>>>>> time > >>>>>>> nothing is being replicated at all. I can see some brief > >>>>>>> replications > >>>>>>> that last, but 99,9% of the time nothing is happening at all. > >>>>>>> > >>>>>>> I have a replication_full_sync_interval of 12 hours but I have > >>>>>>> thousands of users with their last full sync over 90 hours ago. > >>>>>>> > >>>>>>> "doveadm replicator status" also shows that i have over 35,000 > >>>>>>> queued > >>>>>>> full resync requests, but no sync, high or low queued requests so > >>>>>>> why > >>>>>>> aren't the full requests occuring? > >>>>>>> > >>>>>>> There are no errors in the logs. > >>>>>>> > >>>>>>> Thanks, > >>>>>>> > >>>>>>> Arnaud > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>> > >>> > >