Re: no full syncs after upgrading to dovecot 2.3.18

Paul Kudla (SCOM.CA Internet Services Inc.) Tue, 26 Apr 2022 06:38:59 -0700

Agreed there seems to be no way of posting these kinds of issues to seeif they are even being addressed or even known about moving forward onnew updates


i read somewhere there is a new branch soming out but nothing as of yet?

2.4 maybe ....
5.0 ........

my previous replication issues (back in feb) went unanswered.

not faulting anyone, but the developers do seem to be disconnected fromissues as of late? or concentrating on other issues.

I have no problem with support contracts for day to day maintencehowever as a programmer myself they usually dont work as the other endrelies on the latest source code anyways. Thus can not help.

I am trying to take a part the replicator c programming based on 2.3.18as most of it does work to some extent.


tcps just does not work (ie 600 seconds default in the c programming)

My thoughts are tcp works ok but fails when the replicator throughdsync-client.c when asked to return the folder list?



replicator-brain.c seems to control the overall process and timing.

replicator-queue.c seems to handle the que file that does seem to carryacurate info.

things in the source code are documented enough to figure this out but iam still going through all the related .h files documentation wise whichare all over the place.

there is no clear documentation on the .h lib files so i have to walkthrough the tree one at a time finding relative code.

since the dsync from doveadm does see to work ok i have to assume thedsync-client used to compile the replicator is at fault somehow or acall from it upstream?

Thanks for your input on the other issues noted below, i will keep thatin mind when disassembling the source code.

No sense in fixing one thing and leaving something else behind, probablyall related anyways.

i have two test servers avaliable so i can play with all this offline toreproduce the issues

Unfortunately I have to make a living first, this will be addressed whenpossible as i dont like systems that are live running this way andcurrently only have 5 accounts with this issue (mine included)





Happy Tuesday !!!
Thanks - paul

Paul Kudla


Scom.ca Internet Services <http://www.scom.ca>
004-1009 Byron Street South
Whitby, Ontario - Canada
L1N 4S3

Toronto 416.642.7266
Main 1.866.411.7266
Fax 1.888.892.7266

On 4/26/2022 9:03 AM, Reuben Farrelly wrote:

I ran into this back in February and documented a reproducible test case(and sent it to this list). In short - I was able to reproduce this byhaving a valid and consistent mailbox on the source/local, creating avery standard empty Maildir/(new|cur|tmp) folder on the remote replica,and then initiating the replicate from the source. This consistentlycaused dsync to fail replication with the error "dovecot.index reset,view is now inconsistent" and sync aborted, leaving the replica mailboxin a screwed up inconsistent state. Client connections on the sourcereplica were also dropped when this error occurred. You can see theerror by enabling debug level logging if you initiate dsync manually ona test mailbox.
The only workaround I found was to remove the remote Maildir and letDovecot create the whole thing from scratch. Dovecot did not like anyexisting folders on the destination replica even if they were the samenames as the source and completely empty. I was able to reproduce thisthe bare minimum of folders - just an INBOX!
I have no idea if any of the developers saw my post or if the bug hasbeen fixed for the next release. But it seemed to be quite a commonproblem over time (saw a few posts from people going back a long waywith the same problem) and it is seriously disruptive to clients. Theerror message is not helpful in tracking down the problem either.
Secondly, I also have had an ongoing and longstanding problem usingtcps: for replication. For some reason using tcps: (with no otherchanges at all to the config) results in a lot of timeout messages"Error: dsync I/O has stalled, no activity for 600 seconds". This goesaway if I revert back to tcp: instead of tcps - with tcp: I very rarelyget timeouts. No idea why, guess this is a bug of some sort also.
It's disappointing that there appears to be no way to have these sortsor problems addressed like there once was. I am not using Dovecot forcommercial purposes so paying a fortune for a support contract for ahigh end installation just isn't going to happen, and this list seems tobe quite ordinary for getting support and reporting bugs nowadays....
Reuben

On 26/04/2022 7:21 pm, Paul Kudla (SCOM.CA Internet Services Inc.) wrote:
side issue
if you are getting inconsistant dsyncs there is no real way to fixthis in the long run.
i know its a pain (already had to my self)
i needed to do a full sync, take one server offline, delete the userdir (with dovecot offline) and then rsync (or somehow duplicate themain server's user data) over the the remote again.
then bring remote back up and it kind or worked worked
best suggestion is to bring the main server down at night so the copyis clean?
if using postfix you can enable the soft bounce option and the mailwill back spool until everything comes back online
(needs to be enable on bother servers)
replication was still an issue on accounts with 300+ folders in them,still working on a fix for that.
Happy Tuesday !!!
Thanks - paul

Paul Kudla


Scom.ca Internet Services <http://www.scom.ca>
004-1009 Byron Street South
Whitby, Ontario - Canada
L1N 4S3

Toronto 416.642.7266
Main 1.866.411.7266
Fax 1.888.892.7266

On 4/25/2022 10:01 AM, Arnaud Abélard wrote:
Ah, I'm now getting errors in the logs, that would explains theincreasing number of failed sync requests:
dovecot: imap(xxxxx)<2961235><Bs6w43rdQPAqAcsFiXEmAInUhhA3Rfqh>:Error: Mailbox INBOX: /vmail/l/i/xxxxx/dovecot.index reset, view isnow inconsistent
And sure enough:

# dovecot replicator status xxxxx

xxxxx         none     00:02:54  07:11:28  -            y


What could explain that error?

Arnaud



On 25/04/2022 15:13, Arnaud Abélard wrote:
Hello,

On my side we are running Linux (Debian Buster).
I'm not sure my problem is actually the same as Paul or youSebastian since I have a lot of boxes but those are actually small(quota of 110MB) so I doubt any of them have more than a dozen imapfolders.
The main symptom is that I have tons of full sync requests awaitingbut even though no other sync is pending the replicator just waitsfor something to trigger those syncs.
Today, with users back I can see that normal and incremental syncsare being done on the 15 connections, with an occasional full synchere or there and lots of "Waiting 'failed' requests":
Queued 'sync' requests        0

Queued 'high' requests        0

Queued 'low' requests         0

Queued 'failed' requests      122

Queued 'full resync' requests 28785

Waiting 'failed' requests     4294

Total number of known users   42512
So, why didn't the replicator take advantage of the weekend toreplicate the mailboxes while no user were using them?
Arnaud




On 25/04/2022 13:54, Sebastian Marske wrote:
Hi there,

thanks for your insights and for diving deeper into this Paul!

For me, the users ending up in 'Waiting for dsync to finish' all have
more than 256 Imap folders as well (ranging from 288 up to >5500;as per
'doveadm mailbox list -u <username> | wc -l'). For more details on my
setup please see my post from February [1].

@Arnaud: What OS are you running on?


Best
Sebastian


[1] https://dovecot.org/pipermail/dovecot/2022-February/124168.html


On 4/24/22 19:36, Paul Kudla (SCOM.CA Internet Services Inc.) wrote:
Question having similiar replication issues

pls read everything below and advise the folder counts on the
non-replicated users?
i find the total number of folders / account seems to be a factorand
NOT the size of the mail box
ie i have customers with 40G of emails no problem over 40 or sofolders
and it works ok

300+ folders seems to be the issue

i have been going through the replication code

no errors being logged

i am assuming that the replication --> dhclient --> other server is
timing out or not reading the folder lists correctly (ie dies after X
folders read)

thus i am going through the code patching for log entries etc to find
the issues.

see

[13:33:57] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot
# ll
total 86
drwxr-xr-x  2 root  wheel  uarch    4B Apr 24 11:11 .
drwxr-xr-x  4 root  wheel  uarch    4B Mar  8  2021 ..
-rw-r--r--  1 root  wheel  uarch   73B Apr 24 11:11 instances
-rw-r--r--  1 root  wheel  uarch  160K Apr 24 13:33 replicator.db

[13:33:58] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot
#

replicator.db seems to get updated ok but never processed properly.

# sync.users
n...@elirpa.com                   high     00:09:41 463:47:01 -     y
ke...@elirpa.com                  high     00:09:23 463:45:43 -     y
p...@scom.ca                      high     00:09:41 463:46:51 -     y
e...@scom.ca                        high     00:09:43 463:47:01 -     y
ed.ha...@dssmgmt.com              high     00:09:42 463:46:58 -     y
p...@paulkudla.net high 00:09:44 463:47:03580:35:07
    y




so ....



two things :
first to get the production stuff to work i had to write a scriptthat
whould find the bad sync's and the force a dsync between the servers

i run this every five minutes or each server.

in crontab

*/10    *                *    *    *    root /usr/bin/nohup
/programs/common/sync.recover > /dev/null


python script to sort things out

# cat /programs/common/sync.recover
#!/usr/local/bin/python3

#Force sync between servers that are reporting bad?

import os,sys,django,socket
from optparse import OptionParser


from lib import *

#Sample Re-Index MB
#doveadm -D force-resync -u p...@scom.ca -f INBOX*



USAGE_TEXT = '''\
usage: %%prog %s[options]
'''

parser = OptionParser(usage=USAGE_TEXT % '', version='0.4')
parser.add_option("-m", "--send_to", dest="send_to", help="SendEmail To")parser.add_option("-e", "--email", dest="email_box", help="Box toIndex")
parser.add_option("-d", "--detail",action='store_true',
dest="detail",default =False, help="Detailed report")
parser.add_option("-i", "--index",action='store_true',
dest="index",default =False, help="Index")

options, args = parser.parse_args()

print (options.email_box)
print (options.send_to)
print (options.detail)

#sys.exit()



print ('Getting Current User Sync Status')
command = commands("/usr/local/bin/doveadm replicator status '*'")


#print command

sync_user_status = command.output.split('\n')

#print sync_user_status

synced = []

for n in range(1,len(sync_user_status)) :
         user = sync_user_status[n]
         print ('Processing User : %s' %user.split(' ')[0])
         if user.split(' ')[0] != options.email_box :
                 if options.email_box != None :
                         continue

         if options.index == True :
command = '/usr/local/bin/doveadm -D force-resync-u %s
-f INBOX*' %user.split(' ')[0]
                 command = commands(command)
                 command = command.output

         #print user
         for nn in range (len(user)-1,0,-1) :
                 #print nn
                 #print user[nn]

                 if user[nn] == '-' :
                         #print 'skipping ... %s' %user.split(' ')[0]

                         break



                 if user[nn] == 'y': #Found a Bad Mailbox
                         print ('syncing ... %s' %user.split(' ')[0])


                         if options.detail == True :
                                 command = '/usr/local/bin/doveadm -D
sync -u %s -d -N -l 30 -U' %user.split(' ')[0]
                                 print (command)
                                 command = commands(command)
                                 command = command.output.split('\n')
                                 print (command)
print ('Processed Mailbox for ...%s'
%user.split(' ')[0] )
synced.append('Processed Mailboxfor ...
%s' %user.split(' ')[0])
                                 for nnn in range(len(command)):
synced.append(command[nnn] + '\n')
                                 break


                         if options.detail == False :
#command ='/usr/local/bin/doveadm -D
sync -u %s -d -N -l 30 -U' %user.split(' ')[0]
                                 #print (command)
                                 #command = os.system(command)
                                 command = subprocess.Popen(
["/usr/local/bin/doveadm sync -u %s -d -N -l 30 -U" %user.split('')[0]
], \
shell = True, stdin=None,stdout=None,
stderr=None, close_fds=True)
print ( 'Processed Mailbox for... %s'
%user.split(' ')[0] )
synced.append('Processed Mailboxfor ...
%s' %user.split(' ')[0])
                                 #sys.exit()
                                 break

if len(synced) != 0 :
         #send email showing bad synced boxes ?

         if options.send_to != None :
                 send_from = 'moni...@scom.ca'
                 send_to = ['%s' %options.send_to]
                 send_subject = 'Dovecot Bad Sync Report for : %s'
%(socket.gethostname())
                 send_text = '\n\n'
                 for n in range (len(synced)) :
                         send_text = send_text + synced[n] + '\n'

                 send_files = []
sendmail (send_from, send_to, send_subject,send_text,
send_files)



sys.exit()

second :

i posted this a month ago - no response

please appreciate that i am trying to help ....

after much testing i can now reporduce the replication issues at hand
I am running on freebsd 12 & 13 stable (both test and productionservers)
sdram drives etc ...
Basically replication works fine until reaching a folder quantityof ~
256 or more

to reproduce using doveadm i created folders like

INBOX/folder-0
INBOX/folder-1
INBOX/folder-2
INBOX/folder-3
and so forth ......

I created 200 folders and they replicated ok on both servers
I created another 200 (400 total) and the replicator got stuck andwould
not update the mbox on the alternate server anymore and is still
updating 4 days later ?
basically replicator goes so far and either hangs or more likelybails
on an error that is not reported to the debug reporting ?
however dsync will sync the two servers but only when run manually(ie
all the folders will sync)
I have two test servers avaliable if you need any kind of access -again
here to help.

[07:28:42] mail18.scom.ca [root:0] ~
# sync.status
Queued 'sync' requests        0
Queued 'high' requests        6
Queued 'low' requests         0
Queued 'failed' requests      0
Queued 'full resync' requests 0
Waiting 'failed' requests     0
Total number of known users   255

username                       type        status
p...@scom.ca normal Waiting for dsync tofinishke...@elirpa.com incremental Waiting for dsync tofinished.ha...@dssmgmt.com incremental Waiting for dsync tofinishe...@scom.ca incremental Waiting for dsync tofinishn...@elirpa.com incremental Waiting for dsync tofinishp...@paulkudla.net incremental Waiting for dsync tofinish
i have been going through the c code and it seems the replicationgets
requested ok

replicator.db does get updated ok with the replicated request for the
mbox in question.

however i am still looking for the actual replicator function in the
lib's that do the actual replication requests

the number of folders & subfolders is defanately the issue - not the
mbox pyhsical size as thought origionally.

if someone can point me in the right direction, it seems either the
replicator is not picking up on the number of folders to replicat
properly or it has a hard set limit like 256 / 512 / 65535 etc andstops
the replication request thereafter.

I am mainly a machine code programmer from the 80's and have
concentrated on python as of late, 'c' i am starting to go throughjust
to give you a background on my talents.

It took 2 months to finger this out.

this issue also seems to be indirectly causing the duplicate messages
supression not to work as well.
python programming to reproduce issue (loops are for last runstarted @
200 - fyi) :

# cat mbox.gen
#!/usr/local/bin/python2

import os,sys

from lib import *


user = 'p...@paulkudla.net'

"""
for count in range (0,600) :
         box = 'INBOX/folder-%s' %count
         print count
command = '/usr/local/bin/doveadm mailbox create -s -u %s%s'
%(user,box)
         print command
         a = commands.getoutput(command)
         print a
"""

for count in range (0,600) :
         box = 'INBOX/folder-0/sub-%' %count
         print count
command = '/usr/local/bin/doveadm mailbox create -s -u %s%s'
%(user,box)
         print command
         a = commands.getoutput(command)
         print a



         #sys.exit()





Happy Sunday !!!
Thanks - paul

Paul Kudla


Scom.ca Internet Services <http://www.scom.ca>
004-1009 Byron Street South
Whitby, Ontario - Canada
L1N 4S3

Toronto 416.642.7266
Main 1.866.411.7266
Fax 1.888.892.7266

On 4/24/2022 10:22 AM, Arnaud Abélard wrote:
Hello,

I am working on replicating a server (and adding compression on the
other side) and since I had "Error: dsync I/O has stalled, noactivityfor 600 seconds (version not received)" errors I upgraded bothsource
and destination server with the latest 2.3 version (2.3.18). While
before the upgrade all the 15 replication connections were busyafterupgrading dovecot replicator dsync-status shows that most of thetimenothing is being replicated at all. I can see some briefreplications
that last, but 99,9% of the time nothing is happening at all.

I have a replication_full_sync_interval of 12 hours but I have
thousands of users with their last full sync over 90 hours ago.
"doveadm replicator status" also shows that i have over 35,000queuedfull resync requests, but no sync, high or low queued requests sowhy
aren't the full requests occuring?

There are no errors in the logs.

Thanks,

Arnaud

Re: no full syncs after upgrading to dovecot 2.3.18

Reply via email to