Question having similiar replication issues
pls read everything below and advise the folder counts on the
non-replicated users?
i find the total number of folders / account seems to be a factor
and
NOT the size of the mail box
ie i have customers with 40G of emails no problem over 40 or so
folders
and it works ok
300+ folders seems to be the issue
i have been going through the replication code
no errors being logged
i am assuming that the replication --> dhclient --> other server is
timing out or not reading the folder lists correctly (ie dies after X
folders read)
thus i am going through the code patching for log entries etc to find
the issues.
see
[13:33:57] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot
# ll
total 86
drwxr-xr-x 2 root wheel uarch 4B Apr 24 11:11 .
drwxr-xr-x 4 root wheel uarch 4B Mar 8 2021 ..
-rw-r--r-- 1 root wheel uarch 73B Apr 24 11:11 instances
-rw-r--r-- 1 root wheel uarch 160K Apr 24 13:33 replicator.db
[13:33:58] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot
#
replicator.db seems to get updated ok but never processed properly.
# sync.users
n...@elirpa.com high 00:09:41 463:47:01 - y
ke...@elirpa.com high 00:09:23 463:45:43 - y
p...@scom.ca high 00:09:41 463:46:51 - y
e...@scom.ca high 00:09:43 463:47:01 - y
ed.ha...@dssmgmt.com high 00:09:42 463:46:58 - y
p...@paulkudla.net high 00:09:44 463:47:03
580:35:07
y
so ....
two things :
first to get the production stuff to work i had to write a script
that
whould find the bad sync's and the force a dsync between the servers
i run this every five minutes or each server.
in crontab
*/10 * * * * root /usr/bin/nohup
/programs/common/sync.recover > /dev/null
python script to sort things out
# cat /programs/common/sync.recover
#!/usr/local/bin/python3
#Force sync between servers that are reporting bad?
import os,sys,django,socket
from optparse import OptionParser
from lib import *
#Sample Re-Index MB
#doveadm -D force-resync -u p...@scom.ca -f INBOX*
USAGE_TEXT = '''\
usage: %%prog %s[options]
'''
parser = OptionParser(usage=USAGE_TEXT % '', version='0.4')
parser.add_option("-m", "--send_to", dest="send_to", help="Send
Email To")
parser.add_option("-e", "--email", dest="email_box", help="Box to
Index")
parser.add_option("-d", "--detail",action='store_true',
dest="detail",default =False, help="Detailed report")
parser.add_option("-i", "--index",action='store_true',
dest="index",default =False, help="Index")
options, args = parser.parse_args()
print (options.email_box)
print (options.send_to)
print (options.detail)
#sys.exit()
print ('Getting Current User Sync Status')
command = commands("/usr/local/bin/doveadm replicator status '*'")
#print command
sync_user_status = command.output.split('\n')
#print sync_user_status
synced = []
for n in range(1,len(sync_user_status)) :
user = sync_user_status[n]
print ('Processing User : %s' %user.split(' ')[0])
if user.split(' ')[0] != options.email_box :
if options.email_box != None :
continue
if options.index == True :
command = '/usr/local/bin/doveadm -D force-resync
-u %s
-f INBOX*' %user.split(' ')[0]
command = commands(command)
command = command.output
#print user
for nn in range (len(user)-1,0,-1) :
#print nn
#print user[nn]
if user[nn] == '-' :
#print 'skipping ... %s' %user.split(' ')[0]
break
if user[nn] == 'y': #Found a Bad Mailbox
print ('syncing ... %s' %user.split(' ')[0])
if options.detail == True :
command = '/usr/local/bin/doveadm -D
sync -u %s -d -N -l 30 -U' %user.split(' ')[0]
print (command)
command = commands(command)
command = command.output.split('\n')
print (command)
print ('Processed Mailbox for ...
%s'
%user.split(' ')[0] )
synced.append('Processed Mailbox
for ...
%s' %user.split(' ')[0])
for nnn in range(len(command)):
synced.append(command[nnn] + '\n')
break
if options.detail == False :
#command =
'/usr/local/bin/doveadm -D
sync -u %s -d -N -l 30 -U' %user.split(' ')[0]
#print (command)
#command = os.system(command)
command = subprocess.Popen(
["/usr/local/bin/doveadm sync -u %s -d -N -l 30 -U" %user.split('
')[0]
], \
shell = True, stdin=None,
stdout=None,
stderr=None, close_fds=True)
print ( 'Processed Mailbox for
... %s'
%user.split(' ')[0] )
synced.append('Processed Mailbox
for ...
%s' %user.split(' ')[0])
#sys.exit()
break
if len(synced) != 0 :
#send email showing bad synced boxes ?
if options.send_to != None :
send_from = 'moni...@scom.ca'
send_to = ['%s' %options.send_to]
send_subject = 'Dovecot Bad Sync Report for : %s'
%(socket.gethostname())
send_text = '\n\n'
for n in range (len(synced)) :
send_text = send_text + synced[n] + '\n'
send_files = []
sendmail (send_from, send_to, send_subject,
send_text,
send_files)
sys.exit()
second :
i posted this a month ago - no response
please appreciate that i am trying to help ....
after much testing i can now reporduce the replication issues at hand
I am running on freebsd 12 & 13 stable (both test and production
servers)
sdram drives etc ...
Basically replication works fine until reaching a folder quantity
of ~
256 or more
to reproduce using doveadm i created folders like
INBOX/folder-0
INBOX/folder-1
INBOX/folder-2
INBOX/folder-3
and so forth ......
I created 200 folders and they replicated ok on both servers
I created another 200 (400 total) and the replicator got stuck and
would
not update the mbox on the alternate server anymore and is still
updating 4 days later ?
basically replicator goes so far and either hangs or more likely
bails
on an error that is not reported to the debug reporting ?
however dsync will sync the two servers but only when run manually
(ie
all the folders will sync)
I have two test servers avaliable if you need any kind of access -
again
here to help.
[07:28:42] mail18.scom.ca [root:0] ~
# sync.status
Queued 'sync' requests 0
Queued 'high' requests 6
Queued 'low' requests 0
Queued 'failed' requests 0
Queued 'full resync' requests 0
Waiting 'failed' requests 0
Total number of known users 255
username type status
p...@scom.ca normal Waiting for dsync to
finish
ke...@elirpa.com incremental Waiting for dsync to
finish
ed.ha...@dssmgmt.com incremental Waiting for dsync to
finish
e...@scom.ca incremental Waiting for dsync to
finish
n...@elirpa.com incremental Waiting for dsync to
finish
p...@paulkudla.net incremental Waiting for dsync to
finish
i have been going through the c code and it seems the replication
gets
requested ok
replicator.db does get updated ok with the replicated request for the
mbox in question.
however i am still looking for the actual replicator function in the
lib's that do the actual replication requests
the number of folders & subfolders is defanately the issue - not the
mbox pyhsical size as thought origionally.
if someone can point me in the right direction, it seems either the
replicator is not picking up on the number of folders to replicat
properly or it has a hard set limit like 256 / 512 / 65535 etc and
stops
the replication request thereafter.
I am mainly a machine code programmer from the 80's and have
concentrated on python as of late, 'c' i am starting to go through
just
to give you a background on my talents.
It took 2 months to finger this out.
this issue also seems to be indirectly causing the duplicate messages
supression not to work as well.
python programming to reproduce issue (loops are for last run
started @
200 - fyi) :
# cat mbox.gen
#!/usr/local/bin/python2
import os,sys
from lib import *
user = 'p...@paulkudla.net'
"""
for count in range (0,600) :
box = 'INBOX/folder-%s' %count
print count
command = '/usr/local/bin/doveadm mailbox create -s -u %s
%s'
%(user,box)
print command
a = commands.getoutput(command)
print a
"""
for count in range (0,600) :
box = 'INBOX/folder-0/sub-%' %count
print count
command = '/usr/local/bin/doveadm mailbox create -s -u %s
%s'
%(user,box)
print command
a = commands.getoutput(command)
print a
#sys.exit()
Happy Sunday !!!
Thanks - paul
Paul Kudla
Scom.ca Internet Services <http://www.scom.ca>
004-1009 Byron Street South
Whitby, Ontario - Canada
L1N 4S3
Toronto 416.642.7266
Main 1.866.411.7266
Fax 1.888.892.7266
On 4/24/2022 10:22 AM, Arnaud Abélard wrote:
Hello,
I am working on replicating a server (and adding compression on the
other side) and since I had "Error: dsync I/O has stalled, no
activity
for 600 seconds (version not received)" errors I upgraded both
source
and destination server with the latest 2.3 version (2.3.18). While
before the upgrade all the 15 replication connections were busy
after
upgrading dovecot replicator dsync-status shows that most of the
time
nothing is being replicated at all. I can see some brief
replications
that last, but 99,9% of the time nothing is happening at all.
I have a replication_full_sync_interval of 12 hours but I have
thousands of users with their last full sync over 90 hours ago.
"doveadm replicator status" also shows that i have over 35,000
queued
full resync requests, but no sync, high or low queued requests so
why
aren't the full requests occuring?
There are no errors in the logs.
Thanks,
Arnaud