Hello all, Just tested this morning : I can confirm that issue seems to be resolved for me after upgrading both servers from 2.3.7.2 to 2.3.9.
Refs : * https://dovecot.org/pipermail/dovecot/2019-October/117353.html * https://dovecot.org/pipermail/dovecot/2019-November/117467.html No more "I/O has stalled" error messages and replication works fine now. Thanks very much to the Dovecot team. Have a nice day. Fabien -----Message d'origine----- De : dovecot <dovecot-boun...@dovecot.org> De la part de Piper Andreas via dovecot Envoyé : vendredi 6 décembre 2019 07:10 À : dovecot@dovecot.org Objet : Re: [2.3.8] possible replication issue Hello Timo, upgrading both replicators did the job! Both replicators now run v2.3.9 and replication works fine, all sync-jobs which queued up during the upgrading have been processed successfully. Thanks for the reassurement and all your great work with dovecot, Andreas Am 05.12.19 um 13:15 schrieb Timo Sirainen via dovecot: > I think there's a good chance that upgrading both will fix it. The bug > already existed in old versions, it just wasn't normally triggered. > Since v2.3.8 this situation is triggered on one dsync side, so the > v2.3.9 fix needs to be on the other side. > >> On 5. Dec 2019, at 8.34, Piper Andreas via dovecot >> <dovecot@dovecot.org <mailto:dovecot@dovecot.org>> wrote: >> >> Hello, >> >> upgrading to 2.3.9 unfortunately does *not* solve this issue: >> >> I upgraded one of my replicators from 2.3.7.2 to 2.3.9 and after some >> seconds replication stopped. The other replicator remained with >> 2.3.7.2. After downgrading to 2.3.7.2 replication is again working fine. >> >> I did not try to upgrade both replicators up to now, as this is a live >> production system. Is there a chance, that upgrading both replicators >> will solve the problem? >> >> The machines are running Ubuntu 18.04 >> >> Any help is appreciated. >> >> Thanks, >> Andreas >> >> Am 18.10.19 um 13:52 schrieb Carsten Rosenberg via dovecot: >>> Hi, >>> some of our customers have discovered a replication issue after >>> upgraded from 2.3.7.2 to 2.3.8. >>> Running 2.3.8 several replication connections are hanging until defined >>> timeout. So after some seconds there are $replication_max_conns hanging >>> connections. >>> Other replications are running fast and successful. >>> Also running a doveadm sync tcp:... is working fine for all users. >>> I can't see exactly, but I haven't seen mailboxes timeouting again and >>> again. So I would assume it's not related to the mailbox. >>> From the logs: >>> server1: >>> Oct 16 08:29:25 server1 dovecot[5715]: >>> dsync-local(userna...@domain.com >>> <mailto:userna...@domain.com>)<FXnVDW22pl0tGAAA1cwDxA>: Error: >>> dsync(172.16.0.1): I/O has stalled, no activity for 600 seconds (version >>> not received) >>> Oct 16 08:29:25 server1 dovecot[5715]: >>> dsync-local(userna...@domain.com >>> <mailto:userna...@domain.com>)<FXnVDW22pl0tGAAA1cwDxA>: Error: >>> Timeout during state=master_recv_handshake >>> server2: >>> Oct 16 08:29:25 server2 dovecot[8113]: doveadm: Error: read(server1) >>> failed: EOF (last sent=handshake, last recv=handshake) >>> There aren't any additional logs regarding the replication. >>> I have tried increasing vsz_limit or reducing replication_max_conns. >>> Nothing changed. >>> -- >>> Both customers have 10k+ users. Currently I couldn't reproduce this on >>> smaller test systems. >>> Both installation were downgraded to 2.3.7.2 to fix the issue for now >>> -- >>> I've attached a tcpdump showing the client showing the client stops >>> sending any data after the mailbox_guid table headers. >>> Any idea what could be wrong here or the debug this issue? >>> Thanks. >>> Carsten Rosenberg >> >>