Wietse,
Understood. Thank you for the clear correction — I had assumed "ms"
was a valid time(5) suffix and it propagated into our internal
documentation. That was on me, not the docs.
So our 2026-05-04 incident is now fully explained: setting an invalid
"500ms" at smtp_destination_rate_delay was accepted at startup
(per-transport variant defers validation), then fatal-exited qmgr at
the first SMTP delivery attempt, and the qmgr restart cascaded to
stall every other transport. Heavier queue triggered the first
delivery sooner, which is why mx1 (busy) reproduced and mx2/dev
(idle) didn't. There is no separate "Bug B" — it was Bug A wearing a
load-dependent disguise.
Our updated tuning rules:
- Minimum rate_delay is 1s. Sub-second is architecturally
impossible at the rate_delay knob, full stop.
- Always set time values at default_destination_* (qmgr fatals
loudly at startup if syntax is bad).
- Never tune time values at smtp_destination_* /
relay_destination_* / per-transport (validation deferred to
first invocation = potential silent failure under load).
For higher throughput we'll scale concurrency or add nodes
horizontally — the right levers in the first place.
Thank you again for the patience walking us through this. The
operational rules above are locked into our internal documentation.
Yoda
On 5/5/26 9:29 AM, Wietse Venema via
Postfix-users wrote:
Yoda via Postfix-users:One question I want to ask separately from the syntax case before we lock in our tuning rules:In the 2026-05-04 incident, the mx1 stall occurred after a batched postconf change that included: default_destination_rate_delay = 0 smtp_destination_rate_delay = 500ms default_destination_concurrency_limit = 8 default_destination_recipient_limit = 100 "500ms" is valid integer + ms suffix per time(5), so I would not500ms IS NOT valid Postfix syntax. As documented: To enable the delay, specify a non-zero time value (an integral value plus an optional one-letter suffix that specifies the time unit). Time units: s (seconds), m (minutes), h (hours), d (days), w (weeks). The default time unit is s (seconds).Under sustained load with many active recipient destinations (hundreds) and a backed-up queue, is sub-second default_destination_rate_delay (e.g. "500ms", valid integer + ms syntax, set at the default level only) safe?It is an INVALID configuration. It is therefore not safe from a mail delivery performance point of view. Wietse _______________________________________________ Postfix-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
_______________________________________________ Postfix-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
