[pfx] Re: qmgr stalls when default_destination_rate_delay drops below 1s on a node, with a backed-up queue

Yoda via Postfix-users Tue, 05 May 2026 06:17:12 -0700

Hi Wietse,                                                      
                                                                                                                                                      
  Thank you for the detailed follow-up. The fatal-at-first-invocation +
  restart-cascading-to-all-transports mechanic is exactly the piece I
  was missing — it accounts for the load-dependent appearance, the
  "0 smtp delivery processes" symptom, and why mx2 and dev never
  enough to trigger the fatal).                                  
                               
  One question I want to ask separately from the syntax case before we
  lock in our tuning rules:                                           
                           
  In the 2026-05-04 incident, the mx1 stall occurred after a batched
  postconf change that included:                                    
                                
      default_destination_rate_delay      = 0
      smtp_destination_rate_delay         = 500ms
      default_destination_concurrency_limit = 8  
      default_destination_recipient_limit   = 100                                                                                                     
                                                 
  "500ms" is valid integer + ms suffix per time(5), so I would not                                                                                    
  expect a syntax fatal there. We rolled back to baseline (5/1s/50)                                                                                   
  without capturing a clean `postconf -n` during the incident, so I
  cannot prove what was actually applied at the moment of the stall.                                                                                  
  There may have been an intermediate state I'm not aware of.                                                                                         
                                                             
  The question, separated from Bug A:                                                                                                                 
                                                                                                                                                      
    Under sustained load with many active recipient destinations
    (hundreds) and a backed-up queue, is sub-second                                                                                                   
    default_destination_rate_delay (e.g. "500ms", valid integer + ms
    syntax, set at the default level only) safe? Or is there a                                                                                        
    load-dependent interaction with the queue manager's per-destination
    scheduling state that could cause similar stall-looking behavior                                                                                  
    independent of syntax fatals?                                   
                                                                                                                                                      
  Put differently: is "Bug B" actually a separate phenomenon, or is
  "Bug B" just "Bug A" that I misidentified — i.e., what I observed                                                                                   
  was a transient invalid-syntax state I didn't capture, rather than
  valid 500ms misbehaving under load?                                                                                                                 
                                                                                                                                                      
  For shape context:                                                                                                                                  
                                                                                                                                                      
    - Postfix 3.7.11 on Debian 12                                                                                                                     
    - Inbound MTA filtering for SaaS customers (anti-spam)                                                                                            
    - Outbound delivery to customer M365 connectors + arbitrary
      external recipients                                                                                                                             
    - Busy node: ~100k inbound emails/day, mixed outbound destinations
    - Current baseline (5 / 1s / 50, cohort 10) is working fine; we'd 
      just like to know whether tighter sub-second tuning is achievable                                                                               
      for M365-heavy throughput, or whether 1s is the safe floor.                                                                                     
                                                                                                                                                      
  If sub-second is safe, the operational rule "set time values only at                                                                                
  default_destination_* and never at smtp_destination_*" is the       
  backstop — Bug A could never reach production because qmgr would                                                                                    
  fatal at startup, not silently mid-load.                        
                                                                                                                                                      
  If sub-second under load has its own pitfalls independent of syntax,                                                                                
  we'd rather know now and stay at 1s.                                                                                                                
                                                                                                                                                      
  Thanks again for the patience walking through this thread. The                                                                                      
  discipline rule "never tune time values at smtp_destination_*" is
  locked into our internal documentation.                                                                                                             
                                                                  
      Yoda 

On 5/4/26 3:10 PM, Wietse Venema via Postfix-users wrote:

Wietse Venema via Postfix-users:

Yoda via Postfix-users:

 What I'd like to understand:
      
 1. Is sub-second default_destination_rate_delay safe to use under
    sustained load on a queue that already has tens-to-hundreds of
    active recipient destinations? Or is there a load-dependent
    interaction with qmgr's per-destination scheduling state

As documented https://www.postfix.org/postconf.5.html#default_destination_rate_delay


   To enable the delay, specify a non-zero time value  (an  integral  value
   plus an optional one-letter suffix that specifies the time unit).

When I set the system-wide rate delay:

    # postconf default_destination_rate_delay=0.5s
    # postfix reload

The queue manager logs a fatal error:

    May  4 14:02:01 wzv postfix/qmgr[2216906]: fatal: parameter
	default_destination_rate_delay: bad time value or unit: 0.5s

For some reason this integer constaint is not enforced for
smtp_destination_rate_delay, relay_destination_rate_delay, and so
on, meaning that the parameter value may not be used.

A transport-specific rate delay is enforced later, when the
delivery transport is first used.

With this configuration:

    # postconf -n|grep rate_delay
    smtp_destination_rate_delay = 0.5

The queue manager will not immediately complain after "postfix
start" or "postfix reload".

It will complain at the first attempt to deliver mail using the
'smtp' delivery transport:

    May  4 14:49:27 wzv postfix/qmgr[2224456]: 4g8Vy753yNzcsjZ:
	from=<[email protected]>, size=279, nrcpt=1 (queue active)

    May  4 14:49:27 wzv postfix/qmgr[2224456]: fatal: parameter
	smtp_destination_rate_delay: bad time value or unit: 0.5

Depending on your email mix, some time may pass between queue manager
startup or restart, and the first time that a specific delivery
transport is used.

When the queue manager logs a fatal error, there will be a delay
before the queue manager is restarted. That delay will also affect
deliveries using other delivery transports, so that they appear to
stall.

Lesson learned: look for warning/fatal/panic messsages in the log.
https://www.postfix.org/DEBUG_README.html#logging

	Wietse
_______________________________________________
Postfix-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

_______________________________________________
Postfix-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[pfx] Re: qmgr stalls when default_destination_rate_delay drops below 1s on a node, with a backed-up queue

Reply via email to