On 2023-01-23 at 12:06:34 UTC-0500 (Mon, 23 Jan 2023 17:06:34 +0000)
White, Daniel E. (GSFC-770.0)[AEGIS] <daniel.e.wh...@nasa.gov>
is rumored to have said:
There was no outage.
The queue filled faster than the processes could process them through.
I do not know which limit to increase to accommodate such bursts of
traffic.
I did find 27 instances of this block of info in the logs:
postfix/qmgr[PID]: QUEUE_ID: from=<sender>, size=1370, nrcpt=1 (queue
active)
postfix/qmgr[PID]: warning: mail for [127.0.0.1]:10024 is using up
NUMBER of NUMBER active queue entries
postfix/qmgr[PID]: warning: you may need to reduce smtp-amavis connect
and helo timeouts
postfix/qmgr[PID]: warning: so that Postfix quickly skips unavailable
hosts
postfix/qmgr[PID]: warning: you may need to increase the main.cf
minimal_backoff_time and maximal_backoff_time
postfix/qmgr[PID]: warning: so that Postfix wastes less time on
undeliverable mail
postfix/qmgr[PID]: warning: you may need to increase the master.cf
smtp-amavis process limit
postfix/qmgr[PID]: warning: please avoid flushing the whole queue when
you have
postfix/qmgr[PID]: warning: lots of deferred mail, that is bad for
performance
postfix/qmgr[PID]: warning: to turn off these warnings specify:
qmgr_clog_warn_time = 0
From postconf:
minimal_backoff_time = 300s
maximal_backoff_time = 4000s
smtp_helo_timeout = 300s
But where do I find smtp-amavis connect timeout ?
Is it the milter_connect_timeout ?
No, it appears that you are using Amavisd as a SMTP proxy in between 2
Postfix smtpd processes, not a milter. Presumably 'smtp-amavis' is the
post-proxy smtpd, which uses the standard smtpd_* settings (from main.cf
or defaults) unless you override those settings in master.cf. It is
probably more important to make sure that the smtpd instance on the
output side of the proxy has a process limit equal to the one handling
the external connection, or else that will be a bottleneck and you can
get those warnings from qmgr.
I don't believe that setting the process limit on the outbound side
smtpd service higher than the inbound side provides anything, but Viktor
or Wietse will likely correct me if I'm wrong.
From: Wietse Venema <wie...@porcupine.org>
Date: Monday, January 23, 2023 at 11:28
To: Daniel White <daniel.e.wh...@nasa.gov>
Cc: Postfix users <postfix-users@postfix.org>
Subject: [EXTERNAL] Re: Mail queue took 3 hours to recover from a
flood. Suggestions ?
White, Daniel E. (GSFC-770.0)[AEGIS]:
Around 12000 messages.
The queue went from ~3000 to over 12000 in about 30 minutes and then
took 3 hours to grind through all of them.
I am still trying to determine if this was an accident or not.
The source claims it was not intentionally malicious.
Some postconf values:
default_destination_concurrency_failed_cohort_limit = 1
default_destination_concurrency_limit = 20
default_process_limit = 100
I did not see anything at
http://www.postfix.org/TUNING_README.html<https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.postfix.org%2FTUNING_README.html&data=05%7C01%7Cdaniel.e.white%40nasa.gov%7C9a6019394f694266dd3908dafd5ecf70%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C638100880889689135%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=5UPRuHrxK16Fw47%2FjVKPP4dm35xBAO%2F7CfehjEh30DY%3D&reserved=0>
that looked like it would help, but then we are operating on a
skeleton crew, and I do not have the luxury to spend time digging into
the details.
When a message was not delivered for 30min because of an outage,
then it will take 30min before Postfix tries to deliver that message
again. So it will take at last an hour to clear the queue, more
depending on how much additional mail was queued in the meantime.
Without further details there can be no useful help.
Wietse
--
Bill Cole
b...@scconsult.com or billc...@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire