Hi Wietse, Thank you for the detailed follow-up. The fatal-at-first-invocation + restart-cascading-to-all-transports mechanic is exactly the piece I was missing — it accounts for the load-dependent appearance, the "0 smtp delivery processes" symptom, and why mx2 and dev never enough to trigger the fatal). One question I want to ask separately from the syntax case before we lock in our tuning rules: In the 2026-05-04 incident, the mx1 stall occurred after a batched postconf change that included: default_destination_rate_delay = 0 smtp_destination_rate_delay = 500ms default_destination_concurrency_limit = 8 default_destination_recipient_limit = 100 "500ms" is valid integer + ms suffix per time(5), so I would not expect a syntax fatal there. We rolled back to baseline (5/1s/50) without capturing a clean `postconf -n` during the incident, so I cannot prove what was actually applied at the moment of the stall. There may have been an intermediate state I'm not aware of. The question, separated from Bug A: Under sustained load with many active recipient destinations (hundreds) and a backed-up queue, is sub-second default_destination_rate_delay (e.g. "500ms", valid integer + ms syntax, set at the default level only) safe? Or is there a load-dependent interaction with the queue manager's per-destination scheduling state that could cause similar stall-looking behavior independent of syntax fatals? Put differently: is "Bug B" actually a separate phenomenon, or is "Bug B" just "Bug A" that I misidentified — i.e., what I observed was a transient invalid-syntax state I didn't capture, rather than valid 500ms misbehaving under load? For shape context: - Postfix 3.7.11 on Debian 12 - Inbound MTA filtering for SaaS customers (anti-spam) - Outbound delivery to customer M365 connectors + arbitrary external recipients - Busy node: ~100k inbound emails/day, mixed outbound destinations - Current baseline (5 / 1s / 50, cohort 10) is working fine; we'd just like to know whether tighter sub-second tuning is achievable for M365-heavy throughput, or whether 1s is the safe floor. If sub-second is safe, the operational rule "set time values only at default_destination_* and never at smtp_destination_*" is the backstop — Bug A could never reach production because qmgr would fatal at startup, not silently mid-load. If sub-second under load has its own pitfalls independent of syntax, we'd rather know now and stay at 1s. Thanks again for the patience walking through this thread. The discipline rule "never tune time values at smtp_destination_*" is locked into our internal documentation. Yoda


On 5/4/26 3:10 PM, Wietse Venema via Postfix-users wrote:
Wietse Venema via Postfix-users:
Yoda via Postfix-users:
 What I'd like to understand:
      
 1. Is sub-second default_destination_rate_delay safe to use under
    sustained load on a queue that already has tens-to-hundreds of
    active recipient destinations? Or is there a load-dependent
    interaction with qmgr's per-destination scheduling state
As documented https://www.postfix.org/postconf.5.html#default_destination_rate_delay

   To enable the delay, specify a non-zero time value  (an  integral  value
   plus an optional one-letter suffix that specifies the time unit).

When I set the system-wide rate delay:

    # postconf default_destination_rate_delay=0.5s
    # postfix reload

The queue manager logs a fatal error:

    May  4 14:02:01 wzv postfix/qmgr[2216906]: fatal: parameter
	default_destination_rate_delay: bad time value or unit: 0.5s

For some reason this integer constaint is not enforced for
smtp_destination_rate_delay, relay_destination_rate_delay, and so
on, meaning that the parameter value may not be used.
A transport-specific rate delay is enforced later, when the
delivery transport is first used.

With this configuration:

    # postconf -n|grep rate_delay
    smtp_destination_rate_delay = 0.5

The queue manager will not immediately complain after "postfix
start" or "postfix reload".

It will complain at the first attempt to deliver mail using the
'smtp' delivery transport:

    May  4 14:49:27 wzv postfix/qmgr[2224456]: 4g8Vy753yNzcsjZ:
	from=<[email protected]>, size=279, nrcpt=1 (queue active)

    May  4 14:49:27 wzv postfix/qmgr[2224456]: fatal: parameter
	smtp_destination_rate_delay: bad time value or unit: 0.5

Depending on your email mix, some time may pass between queue manager
startup or restart, and the first time that a specific delivery
transport is used.

When the queue manager logs a fatal error, there will be a delay
before the queue manager is restarted. That delay will also affect
deliveries using other delivery transports, so that they appear to
stall.

Lesson learned: look for warning/fatal/panic messsages in the log.
https://www.postfix.org/DEBUG_README.html#logging

	Wietse
_______________________________________________
Postfix-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
Postfix-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to