Hi Wietse,
Thank you for the detailed follow-up. The fatal-at-first-invocation +
restart-cascading-to-all-transports mechanic is exactly the piece I
was missing — it accounts for the load-dependent appearance, the
"0 smtp delivery processes" symptom, and why mx2 and dev never
enough to trigger the fatal).
One question I want to ask separately from the syntax case before we
lock in our tuning rules:
In the 2026-05-04 incident, the mx1 stall occurred after a batched
postconf change that included:
default_destination_rate_delay = 0
smtp_destination_rate_delay = 500ms
default_destination_concurrency_limit = 8
default_destination_recipient_limit = 100
"500ms" is valid integer + ms suffix per time(5), so I would not
expect a syntax fatal there. We rolled back to baseline (5/1s/50)
without capturing a clean `postconf -n` during the incident, so I
cannot prove what was actually applied at the moment of the stall.
There may have been an intermediate state I'm not aware of.
The question, separated from Bug A:
Under sustained load with many active recipient destinations
(hundreds) and a backed-up queue, is sub-second
default_destination_rate_delay (e.g. "500ms", valid integer + ms
syntax, set at the default level only) safe? Or is there a
load-dependent interaction with the queue manager's per-destination
scheduling state that could cause similar stall-looking behavior
independent of syntax fatals?
Put differently: is "Bug B" actually a separate phenomenon, or is
"Bug B" just "Bug A" that I misidentified — i.e., what I observed
was a transient invalid-syntax state I didn't capture, rather than
valid 500ms misbehaving under load?
For shape context:
- Postfix 3.7.11 on Debian 12
- Inbound MTA filtering for SaaS customers (anti-spam)
- Outbound delivery to customer M365 connectors + arbitrary
external recipients
- Busy node: ~100k inbound emails/day, mixed outbound destinations
- Current baseline (5 / 1s / 50, cohort 10) is working fine; we'd
just like to know whether tighter sub-second tuning is achievable
for M365-heavy throughput, or whether 1s is the safe floor.
If sub-second is safe, the operational rule "set time values only at
default_destination_* and never at smtp_destination_*" is the
backstop — Bug A could never reach production because qmgr would
fatal at startup, not silently mid-load.
If sub-second under load has its own pitfalls independent of syntax,
we'd rather know now and stay at 1s.
Thanks again for the patience walking through this thread. The
discipline rule "never tune time values at smtp_destination_*" is
locked into our internal documentation.
Yoda
On 5/4/26 3:10 PM, Wietse Venema via
Postfix-users wrote:
Wietse Venema via Postfix-users:Yoda via Postfix-users:What I'd like to understand: 1. Is sub-second default_destination_rate_delay safe to use under sustained load on a queue that already has tens-to-hundreds of active recipient destinations? Or is there a load-dependent interaction with qmgr's per-destination scheduling stateAs documented https://www.postfix.org/postconf.5.html#default_destination_rate_delayTo enable the delay, specify a non-zero time value (an integral value plus an optional one-letter suffix that specifies the time unit). When I set the system-wide rate delay: # postconf default_destination_rate_delay=0.5s # postfix reload The queue manager logs a fatal error: May 4 14:02:01 wzv postfix/qmgr[2216906]: fatal: parameter default_destination_rate_delay: bad time value or unit: 0.5s For some reason this integer constaint is not enforced for smtp_destination_rate_delay, relay_destination_rate_delay, and so on, meaning that the parameter value may not be used.A transport-specific rate delay is enforced later, when the delivery transport is first used. With this configuration: # postconf -n|grep rate_delay smtp_destination_rate_delay = 0.5 The queue manager will not immediately complain after "postfix start" or "postfix reload". It will complain at the first attempt to deliver mail using the 'smtp' delivery transport: May 4 14:49:27 wzv postfix/qmgr[2224456]: 4g8Vy753yNzcsjZ: from=<[email protected]>, size=279, nrcpt=1 (queue active) May 4 14:49:27 wzv postfix/qmgr[2224456]: fatal: parameter smtp_destination_rate_delay: bad time value or unit: 0.5 Depending on your email mix, some time may pass between queue manager startup or restart, and the first time that a specific delivery transport is used. When the queue manager logs a fatal error, there will be a delay before the queue manager is restarted. That delay will also affect deliveries using other delivery transports, so that they appear to stall. Lesson learned: look for warning/fatal/panic messsages in the log. https://www.postfix.org/DEBUG_README.html#logging Wietse _______________________________________________ Postfix-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
_______________________________________________ Postfix-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
