Short term DNS issue causing Postfix to queue messages
We've got a pool of servers running postfix. Each server is running bind to cache DNS queries. We are running into an issue where DNS queries are intermittently failing (beyond scope for this discussion). When this happens multiple times consecutively postfix starts queueing ALL mail that would go to this destination for exactly 5 minutes. For example: bind, with query logging turned on, shows several of these logs: Oct 19 11:53:12 hkglppfpool4 named[206415]: client @0x7f32b806b440 127.0.0.1#53827 (cluster9out.us.messagelabs.com): query failed (SERVFAIL) for cluster9out.us.messagelabs.com/IN/A at ../../../bin/named/query.c:8580 At the same time Postfix logs: Oct 19 11:53:12 hkglppfpool4 postfix/smtp[131030]: 4MspyQ3Fm6z511Sx: to=, relay=none, delay=10, delays=0.14/0/10/0, dsn=4.4.3, status=deferred (Host or domain name not found. Name service error for name=cluster9out.us.messagelabs.com type=A: Host not found, try again) When this happens postfix starts deferring ALL mail that should be delivered to cluster9out.us.messagelabs.com for exactly 300 seconds. The named query logs show no queries for this hostname for those 5 minutes, Postfix is not even trying the lookup any more. After the 5 minutes are up, new messages routing to cluster9out.us.messagelabs.com are delivered without being deferred and the queued messages begin to go out. Testing shows that the DNS issue is very short term, lasting for 1 second or so. However the pool of servers can handle a large number of messages in a short time period. The particular combination of events amplifies the short term DNS issue to messages queueing for 5 minutes. We've seen the queues get up over 1000 messages before the 5 minutes are up. Above is just one example. We're seeing these delivery delays going to several different host. The correct solution is to fix the underlying DNS issue. However until then we'd like to mitigate the consequences. Are there configuration options that will a) adjust the number of DNS failures before postfix starts deferring the messages b) adjust the timeout before postfix stops queueing messages Thanks, Eric Wilkison
Re: Short term DNS issue causing Postfix to queue messages
On Wed, Oct 19, 2022 at 4:12 PM Eric Wilkison wrote: Are there configuration options that will > a) adjust the number of DNS failures before postfix starts deferring the > messages > b) adjust the timeout before postfix stops queueing messages Take a look at minimal_backoff_time and queue_run_delay they may help ameliorate the issue. 300 seconds is a default for many of the main.cf parameters, searching http://www.postfix.org/postconf.5.html for 300s will find them all for you. Of course, as you mentioned, fixing the DNS is the best course.
Re: Short term DNS issue causing Postfix to queue messages
On 10/19/2022 3:10 PM, Eric Wilkison wrote: We've got a pool of servers running postfix. Each server is running bind to cache DNS queries. We are running into an issue where DNS queries are intermittently failing (beyond scope for this discussion). When this happens multiple times consecutively postfix starts queueing ALL mail that would go to this destination for exactly 5 minutes. For example: bind, with query logging turned on, shows several of these logs: Oct 19 11:53:12 hkglppfpool4 named[206415]: client @0x7f32b806b440 127.0.0.1#53827 (cluster9out.us.messagelabs.com): query failed (SERVFAIL) for cluster9out.us.messagelabs.com/IN/A at ../../../bin/named/query.c:8580 At the same time Postfix logs: Oct 19 11:53:12 hkglppfpool4 postfix/smtp[131030]: 4MspyQ3Fm6z511Sx: to=, relay=none, delay=10, delays=0.14/0/10/0, dsn=4.4.3, status=deferred (Host or domain name not found. Name service error for name=cluster9out.us.messagelabs.com type=A: Host not found, try again) When this happens postfix starts deferring ALL mail that should be delivered to cluster9out.us.messagelabs.com for exactly 300 seconds. The named query logs show no queries for this hostname for those 5 minutes, Postfix is not even trying the lookup any more. After the 5 minutes are up, new messages routing to cluster9out.us.messagelabs.com are delivered without being deferred and the queued messages begin to go out. Testing shows that the DNS issue is very short term, lasting for 1 second or so. However the pool of servers can handle a large number of messages in a short time period. The particular combination of events amplifies the short term DNS issue to messages queueing for 5 minutes. We've seen the queues get up over 1000 messages before the 5 minutes are up. Above is just one example. We're seeing these delivery delays going to several different host. The correct solution is to fix the underlying DNS issue. However until then we'd like to mitigate the consequences. Are there configuration options that will a) adjust the number of DNS failures before postfix starts deferring the messages b) adjust the timeout before postfix stops queueing messages Thanks, Eric Wilkison Please see http://www.postfix.org/QSHAPE_README.html http://www.postfix.org/TUNING_README.html#hammer With particular attention to the section: http://www.postfix.org/QSHAPE_README.html#backlog Likely setting a custom transport for that destination with a high destination concurrency and high failed cohort setting will reduce the pain of these temporary errors. Unless this is the *only* destination, probably shouldn't adjust the queue run parameters. -- Noel Jones
Re: Short term DNS issue causing Postfix to queue messages
When all deliveries to a site fail (a colhort of delivery agent processes reports the destination is unavailable) the Postfix scheduler puts the destination on a temporary 'dead destination' list, to avoid spending resources on that destination. Of course this design is not optimized for bursts of DNS outages and DNS resource records with short TTL values. Such sites will need special configuration. To eliminate this dead list feature selectively, route mail for this site to a dedicated SMTP transport: /etc/postfix/main.cf transport_maps = hash:/etc/postfix/transport smtp-without-deadlist_destination_concurrency_failed_cohort_limit = 0 smtp-without-deadlist_destination_concurrency_negative_feedback = 0 /etc/postfix/transport: example.com smtp-without-deadlist: /etc/postfix/master.cf: smtp-without-deadlist unix - - n - - smtp Below is some theory from the postconf(5) manpages. Wietse default_destination_concurrency_failed_cohort_limit (default: 1) How many pseudo-cohorts must suffer connection or handshake failure be- fore a specific destination is considered unavailable (and further de- livery is suspended). Specify zero to disable this feature. A destina- tion's pseudo-cohort failure count is reset each time a delivery com- pletes without connection or handshake failure for that specific desti- nation. A pseudo-cohort is the number of deliveries equal to a destination's delivery concurrency. Use transport_destination_concurrency_failed_cohort_limit to specify a transport-specific override, where transport is the master.cf name of the message delivery transport. This feature is available in Postfix 2.5. The default setting is com- patible with earlier Postfix versions. default_destination_concurrency_negative_feedback (default: 1) ... As of Postfix version 2.5, negative feedback cannot reduce delivery concurrency to zero. Instead, a destination is marked dead (further delivery suspended) after the failed pseudo-cohort count reaches $de- fault_destination_concurrency_failed_cohort_limit (or $transport_desti- nation_concurrency_failed_cohort_limit). To make the scheduler com- pletely immune to connection or handshake failures, specify a zero feedback value and a zero failed pseudo-cohort limit.