Short term DNS issue causing Postfix to queue messages

2022-10-19 Thread Eric Wilkison
We've got a pool of servers running postfix.  Each server is running bind to 
cache DNS queries.  We are running into an issue where DNS queries are 
intermittently failing (beyond scope for this discussion).   When this happens 
multiple times consecutively postfix starts queueing ALL mail that would go to 
this destination for exactly 5 minutes.

For example: bind, with query logging turned on, shows several of these logs:

Oct 19 11:53:12 hkglppfpool4 named[206415]: client @0x7f32b806b440 
127.0.0.1#53827 (cluster9out.us.messagelabs.com): query failed (SERVFAIL) for 
cluster9out.us.messagelabs.com/IN/A at ../../../bin/named/query.c:8580

At the same time Postfix logs:
Oct 19 11:53:12 hkglppfpool4 postfix/smtp[131030]: 4MspyQ3Fm6z511Sx: 
to=, relay=none, delay=10, delays=0.14/0/10/0, 
dsn=4.4.3, status=deferred (Host or domain name not found. Name service error 
for name=cluster9out.us.messagelabs.com type=A: Host not found, try again)

When this happens postfix starts deferring ALL mail that should be delivered to 
cluster9out.us.messagelabs.com for exactly 300 seconds.  The named query logs 
show no queries for this hostname for those 5 minutes, Postfix is not even 
trying the lookup any more.  After the 5 minutes are up, new messages routing 
to cluster9out.us.messagelabs.com are delivered without being deferred and the 
queued messages begin to go out.

Testing shows that the DNS issue is very short term, lasting for 1 second or 
so.  However the pool of servers can handle a large number of messages in a 
short time period.  The particular combination of events amplifies the short 
term DNS issue to messages queueing for 5 minutes.  We've seen the queues get 
up over 1000 messages before the 5 minutes are up.  Above is just one example.  
We're seeing these delivery delays going to several different host. 

The correct solution is to fix the underlying DNS issue.  However until then 
we'd like to mitigate the consequences.  Are there configuration options that 
will  
a) adjust the number of DNS failures before postfix starts deferring the 
messages 
b) adjust the timeout before postfix stops queueing messages

Thanks,

Eric Wilkison





Re: Short term DNS issue causing Postfix to queue messages

2022-10-19 Thread Sonic
On Wed, Oct 19, 2022 at 4:12 PM Eric Wilkison  wrote:
 Are there configuration options that will
> a) adjust the number of DNS failures before postfix starts deferring the 
> messages
> b) adjust the timeout before postfix stops queueing messages

Take a look at minimal_backoff_time and queue_run_delay they may help
ameliorate the issue.
300 seconds is a default for many of the main.cf parameters, searching
http://www.postfix.org/postconf.5.html for 300s will find them all for
you.
Of course, as you mentioned, fixing the DNS is the best course.


Re: Short term DNS issue causing Postfix to queue messages

2022-10-19 Thread Noel Jones

On 10/19/2022 3:10 PM, Eric Wilkison wrote:

We've got a pool of servers running postfix.  Each server is running bind to 
cache DNS queries.  We are running into an issue where DNS queries are 
intermittently failing (beyond scope for this discussion).   When this happens 
multiple times consecutively postfix starts queueing ALL mail that would go to 
this destination for exactly 5 minutes.

For example: bind, with query logging turned on, shows several of these logs:

Oct 19 11:53:12 hkglppfpool4 named[206415]: client @0x7f32b806b440 
127.0.0.1#53827 (cluster9out.us.messagelabs.com): query failed (SERVFAIL) for 
cluster9out.us.messagelabs.com/IN/A at ../../../bin/named/query.c:8580

At the same time Postfix logs:
Oct 19 11:53:12 hkglppfpool4 postfix/smtp[131030]: 4MspyQ3Fm6z511Sx: 
to=, relay=none, delay=10, delays=0.14/0/10/0, 
dsn=4.4.3, status=deferred (Host or domain name not found. Name service error for 
name=cluster9out.us.messagelabs.com type=A: Host not found, try again)

When this happens postfix starts deferring ALL mail that should be delivered to 
cluster9out.us.messagelabs.com for exactly 300 seconds.  The named query logs 
show no queries for this hostname for those 5 minutes, Postfix is not even 
trying the lookup any more.  After the 5 minutes are up, new messages routing 
to cluster9out.us.messagelabs.com are delivered without being deferred and the 
queued messages begin to go out.

Testing shows that the DNS issue is very short term, lasting for 1 second or 
so.  However the pool of servers can handle a large number of messages in a 
short time period.  The particular combination of events amplifies the short 
term DNS issue to messages queueing for 5 minutes.  We've seen the queues get 
up over 1000 messages before the 5 minutes are up.  Above is just one example.  
We're seeing these delivery delays going to several different host.

The correct solution is to fix the underlying DNS issue.  However until then 
we'd like to mitigate the consequences.  Are there configuration options that 
will
a) adjust the number of DNS failures before postfix starts deferring the 
messages
b) adjust the timeout before postfix stops queueing messages

Thanks,

Eric Wilkison





Please see
http://www.postfix.org/QSHAPE_README.html
http://www.postfix.org/TUNING_README.html#hammer

With particular attention to the section:
http://www.postfix.org/QSHAPE_README.html#backlog

Likely setting a custom transport for that destination with a high 
destination concurrency and high failed cohort setting will reduce 
the pain of these temporary errors.


Unless this is the *only* destination, probably shouldn't adjust the 
queue run parameters.




  -- Noel Jones






Re: Short term DNS issue causing Postfix to queue messages

2022-10-19 Thread Wietse Venema
When all deliveries to a site fail (a colhort of delivery agent
processes reports the destination is unavailable) the Postfix
scheduler puts the destination on a temporary 'dead destination'
list, to avoid spending resources on that destination.

Of course this design is not optimized for bursts of DNS outages
and DNS resource records with short TTL values. Such sites will
need special configuration.

To eliminate this dead list feature selectively, route mail for
this site to a dedicated SMTP transport:

/etc/postfix/main.cf
transport_maps = hash:/etc/postfix/transport
smtp-without-deadlist_destination_concurrency_failed_cohort_limit = 0
smtp-without-deadlist_destination_concurrency_negative_feedback = 0

/etc/postfix/transport:
example.com smtp-without-deadlist:

/etc/postfix/master.cf:
smtp-without-deadlist  unix  -   -   n   -   -   
smtp

Below is some theory from the postconf(5) manpages.

Wietse

default_destination_concurrency_failed_cohort_limit (default: 1)
   How many pseudo-cohorts must suffer connection or handshake failure be-
   fore a specific destination is considered unavailable (and further  de-
   livery  is suspended). Specify zero to disable this feature. A destina-
   tion's pseudo-cohort failure count is reset each time a  delivery  com-
   pletes without connection or handshake failure for that specific desti-
   nation.

   A pseudo-cohort is the number of deliveries equal  to  a  destination's
   delivery concurrency.

   Use  transport_destination_concurrency_failed_cohort_limit to specify a
   transport-specific override, where transport is the master.cf  name  of
   the message delivery transport.

   This  feature  is available in Postfix 2.5. The default setting is com-
   patible with earlier Postfix versions.

default_destination_concurrency_negative_feedback (default: 1)
...
   As of Postfix version 2.5, negative  feedback  cannot  reduce  delivery
   concurrency  to  zero.   Instead, a destination is marked dead (further
   delivery suspended) after the failed pseudo-cohort count  reaches  $de-
   fault_destination_concurrency_failed_cohort_limit (or $transport_desti-
   nation_concurrency_failed_cohort_limit).  To make  the  scheduler  com-
   pletely  immune  to  connection  or  handshake failures, specify a zero
   feedback value and a zero failed pseudo-cohort limit.