On 21.08.24 11:35, Timo Sirainen wrote:
[Lots and lots of "but my NTP sync is much more precise than that" in
the FreeBSD thread]
The way Dovecot works is:
- It finds the next timeout, sees that it happens in e.g. 5 milliseconds.
- Then it calls kqueue() to wait for I/O for max 5 milliseconds
- Then it notices that it actually returned more than 105 milliseconds
later, and then logs a warning about it.
I think that more information is needed to pinpoint possible causes, and
one of the open questions is: What clock does dovecot look at to
determine how long it *actually* stayed dormant? On Linux, software that
has need of a monotonously increasing "time" to derive guaranteed unique
IDs from often looks at the kernel uptime - which is essentially a count
of ticks since bootup, and *not* being corrected by NTP.
Similarly, it should be determined whether the timeouts of I/O function
called (i.e., kqueue()) are or aren't influenced by NTP's corrections to
system time.
The third information I'd like to have is what client software provides
that NTP sync to the machine; ntpd, chronyd, something else?
(As an example for why this is relevant: Several hundred deviations of
100 ms or more per day sum up to several 10+ seconds per day, if only
they all are in the same direction, or several 115+ ppm. ntpd refuses to
do *slews* correcting by more than 500 ppm; if the OS clock's frequency
error exceeds that, ntpd would need to do *steps* every now and then,
and in a default configuration, an ntpd will refuse to do a *second*
step and *die* instead. Or, if the reference clock sways *back and
forth*, ntpd should very likely complain about its sources' jitter in
the logs. chronyd, however, is more ruthless in whacking the local clock
into "sync" with the external sources, and much more inclined to define
"sync" as "low difference", rather than also taking frequency stability
into account like ntpd.)
Also, this is kind of a problem when it does happen. Since Dovecot
thinks the time moved e.g. 100ms forward, it adjusts all timeouts to
happen 100ms backwards. If this wasn't a true time jump, then these
timeouts now happen 100ms earlier.
That is, of course, a dangerous approach if you do *not* have a
guarantee that the timeouts of the I/O function called are *otherwise*
true to the requested duration. But shouldn't those other
concurrently-running timeouts notice an actual discontinuity of the
timescale just the same as the first one did? Maybe some sort of "N
'nay's needed for a vote of nonconfidence" mechanism would be safer ...
Kind regards,
--
Jochen Bern
Systemingenieur
Binect GmbH
_______________________________________________
dovecot mailing list -- dovecot@dovecot.org
To unsubscribe send an email to dovecot-le...@dovecot.org