On Tue, 25 Jan 2011 00:13:53 +0200, Timo Sirainen <t...@iki.fi> wrote:
> On 24.1.2011, at 23.06, <l...@airstreamcomm.net> wrote:
> 
>> We tested the patch you suggested with no success.  We are seeing
>> timestamps straying into the 100's of seconds of difference, which does
>> not
>> reflect the perceivable drift that shows on our systems clock (which is
>> negligible, sub 1 second).
> ..
>> Created dotlock file's timestamp is different than current time
>> (1295900137 vs 1295899898): /mail/username/Maildir/dovecot-uidlist
> 
> I see you have:
> 
>> dotlock_use_excl: no
> 
> There was another reason why the above error could happen with that
> setting. http://hg.dovecot.org/dovecot-1.2/rev/ab81fbb195e2 fixes it.
> 
> But anyway, the main problem you have is that Dovecot is waiting for
> hundreds of seconds for the uidlist lock. It shouldn't. Meaning your NFS
> server is answering way too slowly to Dovecot.

Timo,

Thanks for the quick reply.  

Per the dovecot wiki on nfs http://wiki.dovecot.org/NFS:

Multi-server setup that tries to flush NFS caches:

mmap_disable = yes
dotlock_use_excl = no # only needed with NFSv2, NFSv3+ supports O_EXCL and
it's faster
mail_nfs_storage = yes # v1.1+ only
mail_nfs_index = yes # v1.1+ only

We configured our systems to match these settings, as we are trying to use
a multi-server setup.  

We certainly have not excluded our NFS server as the culprit for the long
wait, but at this point we are running two server's against that NFS share,
a postfix machine and a dovecot machine, both accessing the disk without
error or any latency.  It's only when we add a second dovecot machine to
the mix when we start to see the dotlock errors.  In our configuration we
have a load balancer that is distributing traffic automatically to the pop
and imap services on our cluster of machines.  Once we add a new server to
the mix the load balancer automatically directs new traffic, and we start
seeing the errors:

Jan 24 13:20:39 10.123.128.105 dovecot: pop3-login: Login:
user=<mpbixler>, method=PLAIN, rip=68.65.35.174, lip=64.33.128.105
Jan 24 13:20:42 10.123.128.105 dovecot: POP3(mpbixler): Disconnected:
Logged out top=0/0, retr=0/0, del=0/8234, size=1450555980
Jan 24 13:40:46 10.123.128.105 dovecot: pop3-login: Login:
user=<mpbixler>, method=PLAIN, rip=68.65.35.174, lip=64.33.128.105
Jan 24 13:40:47 10.123.128.105 dovecot: POP3(mpbixler): Disconnected:
Logged out top=0/0, retr=0/0, del=0/8234, size=1450555980
Jan 24 14:14:28 10.123.128.108 dovecot: pop3-login: Login:
user=<mpbixler>, method=PLAIN, rip=68.65.35.174, lip=64.33.128.108
Jan 24 14:14:28 10.123.128.108 dovecot: POP3(mpbixler): Created dotlock
file's timestamp is different than current time (1295900068 vs 1295899828):
/mail/m/p/mpbixler/Maildir/dovecot-uidlist
Jan 24 14:14:28 10.123.128.108 dovecot: POP3(mpbixler): Created dotlock
file's timestamp is different than current time (1295900068 vs 1295899828):
/mail/m/p/mpbixler/Maildir/dovecot-uidlist
Jan 24 14:14:32 10.123.128.108 dovecot: POP3(mpbixler): Created dotlock
file's timestamp is different than current time (1295900072 vs 1295899833):
/mail/m/p/mpbixler/Maildir/dovecot-uidlist
Jan 24 14:14:33 10.123.128.108 dovecot: POP3(mpbixler): Disconnected:
Logged out top=0/0, retr=1/7449, del=0/8235, size=1450563412
Jan 24 14:34:36 10.123.128.105 dovecot: pop3-login: Login:
user=<mpbixler>, method=PLAIN, rip=68.65.35.174, lip=64.33.128.105
Jan 24 14:34:40 10.123.128.105 dovecot: POP3(mpbixler): Disconnected:
Logged out top=0/0, retr=1/10774, del=0/8236, size=1450574168
Jan 24 14:54:53 10.123.128.105 dovecot: pop3-login: Login:
user=<mpbixler>, method=PLAIN, rip=68.65.35.174, lip=64.33.128.105
Jan 24 14:54:54 10.123.128.105 dovecot: POP3(mpbixler): Disconnected:
Logged out top=0/0, retr=0/0, del=0/8236, size=1450574168


As you can see here, our current mail server 10.123.128.105 is handling
the traffic without incident, and once we add 10.123.128.108 to the cluster
the dovecot service errors out.  From the last time the user was logged in
at 13:40:47 to the login on the new server at 14:14:28 is about 34 minutes,
which from my understanding of the code is not possible for a lock file to
exist that long, and does not reflect in the time stamps in the error log. 


Pretty stumped at this point, as our NFS server appears to be running
smoothly.  All the packet captures we have run show that all the NFS
related traffic is moving back and forth quickly and without error.  

Thanks,

Michael

Reply via email to