Hi all,

     My company sends a bunch of emails to opt-in users to the tune of
about 27 million emails a day, and about 47 million on Saturday (when
we send both daily and weekly emails to those that have just requested
weekly alerts).  This load is spread out over about 30 VMs who each
map to a different public IP.  We had a problem about three weeks ago
where we seemed to universally fall over the I/O threshold on the
disks connected to the VMs, causing the queues to quickly back up and
the systems run out of space.  Long story short(er), the systems are
currently using ramdisk queues while we buy faster drives.  These
queues are significantly smaller (1G total) since each VM only has 2G,
and I've noticed that at busy times, sends to Gmail in particular get
backed up, causing delays of 10-30 minutes.  This was likely happening
before, but with the larger queue sizes, I didn't notice.  The systems
are largely idle, and with in-memory queues, disk I/O isn't an issue.
We split out gmail into it's own transport ages ago, as Gmail
addresses have been our #1 destination for years now (about 10M a day
today).  Even at peak, each VM is sending under 100K emails an hour
(bottlenecks on the sending program), and Google happily accepts the
email from us as fast as we send it without deferral, so I doubt it's
a delivery problem, nor a Postfix or system capacity problem.  I tried
turning some knobs with concurrency, but never saw postfix spin up
smtp processes to max out those settings.  I'm quite certain that the
problem here is my config, but I'm not sure how to proceed.

Here's the postconf from one of the sender servers (IPs cleaned a bit
so the company doesn't get grumpy):

alias_database = hash:/etc/aliases
alias_maps = hash:/etc/aliases
allow_min_user = yes
bounce_queue_lifetime = 4d
command_directory = /usr/sbin
config_directory = /etc/postfix/mail88
daemon_directory = /usr/libexec/postfix
data_directory = /var/lib/postfix/mail88
debug_peer_level = 2
default_destination_concurrency_limit = 20
html_directory = no
in_flow_delay = 0
inet_interfaces = 10.1.1.XXX
local_header_rewrite_clients = permit_mynetworks
mail_owner = postfix
mailbox_delivery_lock = fcntl
mailq_path = /usr/bin/mailq.postfix
manpage_directory = /usr/share/man
masquerade_domains = example.com
maximal_backoff_time = 1800
maximal_queue_lifetime = 4d
message_size_limit = 36700160
minimal_backoff_time = 600
mydestination = $myhostname, localhost.$mydomain, localhost, mail88,
mail88.example.com
mydomain = example.com
myhostname = mail88.example.com
mynetworks = 10.0.0.0/8, 172.18.18.0/23, XXX.XXX.XXX.XXX/25
myorigin = $mydomain
newaliases_path = /usr/bin/newaliases.postfix
queue_directory = /var/spool/postfix/mail88
queue_run_delay = 600
readme_directory = /usr/share/doc/postfix-2.3.3/README_FILES
sample_directory = /usr/share/doc/postfix-2.3.3/samples
sendmail_path = /usr/sbin/sendmail.postfix
setgid_group = postdrop
smtp_connection_cache_destinations = bellsouth.net
smtpd_recipient_restrictions = check_recipient_mx_access
cidr:$config_directory/bogus_mx,    check_recipient_access
hash:$config_directory/recipient_access,    permit_mynetworks,
reject_unauth_destination
smtpd_timeout = 600s
syslog_name = postfix88
transport_maps = hash:$config_directory/transport
unknown_local_recipient_reject_code = 550
virtual_alias_maps = hash:$config_directory/virtual

We have a few transports set up in master.cf:

smtp      inet  n       -       n       -       200       smtpd
lowconn   unix  -    -    n    -    -    smtp
highconn  unix  -    -    n    -    -    smtp
comcast   unix  -       -       n       -       2       smtp
bellsouth unix  -       -       n       -       1       smtp
   -o smtp_connection_cache_time_limit=7
att       unix  -       -       n       -       2       smtp
   -o smtp_connection_cache_time_limit=5
yahoo      unix  -       -       n       -       25       smtp
   -o smtp_connection_cache_time_limit=15
   -o smtp_destination_concurrency_limit=30
gmail     unix  -       -       n       -       50       smtp
   -o smtp_connection_cache_time_limit=15
   -o smtp_destination_concurrency_limit=100

and the transport map looks something like this for the various transports:

# Giving gmail their own transport
gmail.com       gmail:

Logs for the delayed emails are rather boring, looking something like this:

Apr 13 23:52:16 xxx-mail88 postfix88/smtpd[29448]: 56F77E7784AD:
client=sendingserver.example.net[10.1.1.XXX]
Apr 13 23:52:16 xxx-mail88 postfix88/cleanup[12513]: 56F77E7784AD:
message-id=<214e220ccd04ac46.1428987136273.SendAlerts.tomcat@sendingserver>
Apr 13 23:52:16 xxx-mail88 postfix88/qmgr[22994]: 56F77E7784AD:
from=<al...@example.com>, size=73004, nrcpt=1 (queue active)
Apr 14 00:00:00 xxx-mail88 postfix88/smtp[13542]: 56F77E7784AD:
to=<persongettingtheal...@gmail.com>,
relay=gmail-smtp-in.l.google.com[173.194.72.27]:25, conn_use=19,
delay=465, delays=0.1/463/0.29/1.6, dsn=2.0.0, status=sent (250 2.0.0
OK 1428987600 no10si18948696pdb.63 - gsmtp)
Apr 14 00:00:00 xxx-mail88 postfix88/qmgr[22994]: 56F77E7784AD: removed

Usually I'm seeing about 20-25 of the gmail smtp processes in the
process list at peak, but maybe 10000 gmail emails in the active queue
as reported by qshape.

I've read through the tuning documents a few times not and I'm not
sure what I'm missing here, but I'm confident that the problem lies
with the guy typing this email, so I'm happy to provide any
information needed to get over my lack of understanding.

Thanks!
Andrew

Reply via email to