> Yes, although the error message could be changed to "locking timed out". > But at least now the error shouldn't be visible to clients (other than > small slowdowns due to the 2 second lock wait). > > Anyway, the real problem is one of: > > a) Dovecot is really locking dovecot.index.cache file for a long time > for some reason and other processes are timing out because of it.
Almost all cache files are very small. There is no reason this should take a long time. Unless there's something weird in the cache building code that keeps it in a never ending state. > b) Some process is crashing and leaving stale dovecot.index.cache.lock > files lying around. But that'd have to be a .lock from another server, > because on the same server Dovecot checks to see if the PID exists and > if not it'll just override the lock immediately. That could be more likely. We have 30 servers operating on this spool, so if some of them have crashing processes that keep a .lock on a different server, that may cause issues right? Could even be from some old dovecot version? I checked last weeks logs, and i had almost no crashes. About 100 'killed with signal' log lines, out of a few zillion log entries. im doing a find now on dovecot.index.cache.lock files on our nfs indexes dir. > c) NFS caching problems: the .lock file was deleted by server1 but > server2 didn't see that, so it keeps assuming that the file exists long > after it was really gone. But what about this... im also seeing the same problem if I keep nfs=yes and dotlock on a local filesystem instead of NFS. That should exclude any multiple-nfs server issues right? Or will doing nfs=yes on a local FS give weird results? I should just move everything to Linux.. Cor