Wietse, Understood. Thank you for the clear correction — I had assumed "ms" was a valid time(5) suffix and it propagated into our internal documentation. That was on me, not the docs. So our 2026-05-04 incident is now fully explained: setting an invalid "500ms" at smtp_destination_rate_delay was accepted at startup (per-transport variant defers validation), then fatal-exited qmgr at the first SMTP delivery attempt, and the qmgr restart cascaded to stall every other transport. Heavier queue triggered the first delivery sooner, which is why mx1 (busy) reproduced and mx2/dev (idle) didn't. There is no separate "Bug B" — it was Bug A wearing a load-dependent disguise. Our updated tuning rules: - Minimum rate_delay is 1s. Sub-second is architecturally impossible at the rate_delay knob, full stop. - Always set time values at default_destination_* (qmgr fatals loudly at startup if syntax is bad). - Never tune time values at smtp_destination_* / relay_destination_* / per-transport (validation deferred to first invocation = potential silent failure under load). For higher throughput we'll scale concurrency or add nodes horizontally — the right levers in the first place. Thank you again for the patience walking us through this. The operational rules above are locked into our internal documentation. Yoda


On 5/5/26 9:29 AM, Wietse Venema via Postfix-users wrote:
Yoda via Postfix-users:
  One question I want to ask separately from the syntax case before we
  lock in our tuning rules:

  In the 2026-05-04 incident, the mx1 stall occurred after a batched
  postconf change that included:

      default_destination_rate_delay      = 0
      smtp_destination_rate_delay         = 500ms
      default_destination_concurrency_limit = 8
      default_destination_recipient_limit   = 100

  "500ms" is valid integer + ms suffix per time(5), so I would not
500ms IS NOT valid Postfix syntax.

As documented:

       To enable the delay, specify a non-zero time value  (an  integral  value
       plus an optional one-letter suffix that specifies the time unit).

       Time  units:  s  (seconds), m (minutes), h (hours), d (days), w (weeks).
       The default time unit is s (seconds).

    Under sustained load with many active recipient destinations
    (hundreds) and a backed-up queue, is sub-second
    default_destination_rate_delay (e.g. "500ms", valid integer + ms
    syntax, set at the default level only) safe?
It is an INVALID configuration. It is therefore not safe from a
mail delivery performance point of view.

	Wietse
_______________________________________________
Postfix-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
Postfix-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to