Hi all, My company sends a bunch of emails to opt-in users to the tune of about 27 million emails a day, and about 47 million on Saturday (when we send both daily and weekly emails to those that have just requested weekly alerts). This load is spread out over about 30 VMs who each map to a different public IP. We had a problem about three weeks ago where we seemed to universally fall over the I/O threshold on the disks connected to the VMs, causing the queues to quickly back up and the systems run out of space. Long story short(er), the systems are currently using ramdisk queues while we buy faster drives. These queues are significantly smaller (1G total) since each VM only has 2G, and I've noticed that at busy times, sends to Gmail in particular get backed up, causing delays of 10-30 minutes. This was likely happening before, but with the larger queue sizes, I didn't notice. The systems are largely idle, and with in-memory queues, disk I/O isn't an issue. We split out gmail into it's own transport ages ago, as Gmail addresses have been our #1 destination for years now (about 10M a day today). Even at peak, each VM is sending under 100K emails an hour (bottlenecks on the sending program), and Google happily accepts the email from us as fast as we send it without deferral, so I doubt it's a delivery problem, nor a Postfix or system capacity problem. I tried turning some knobs with concurrency, but never saw postfix spin up smtp processes to max out those settings. I'm quite certain that the problem here is my config, but I'm not sure how to proceed.
Here's the postconf from one of the sender servers (IPs cleaned a bit so the company doesn't get grumpy): alias_database = hash:/etc/aliases alias_maps = hash:/etc/aliases allow_min_user = yes bounce_queue_lifetime = 4d command_directory = /usr/sbin config_directory = /etc/postfix/mail88 daemon_directory = /usr/libexec/postfix data_directory = /var/lib/postfix/mail88 debug_peer_level = 2 default_destination_concurrency_limit = 20 html_directory = no in_flow_delay = 0 inet_interfaces = 10.1.1.XXX local_header_rewrite_clients = permit_mynetworks mail_owner = postfix mailbox_delivery_lock = fcntl mailq_path = /usr/bin/mailq.postfix manpage_directory = /usr/share/man masquerade_domains = example.com maximal_backoff_time = 1800 maximal_queue_lifetime = 4d message_size_limit = 36700160 minimal_backoff_time = 600 mydestination = $myhostname, localhost.$mydomain, localhost, mail88, mail88.example.com mydomain = example.com myhostname = mail88.example.com mynetworks = 10.0.0.0/8, 172.18.18.0/23, XXX.XXX.XXX.XXX/25 myorigin = $mydomain newaliases_path = /usr/bin/newaliases.postfix queue_directory = /var/spool/postfix/mail88 queue_run_delay = 600 readme_directory = /usr/share/doc/postfix-2.3.3/README_FILES sample_directory = /usr/share/doc/postfix-2.3.3/samples sendmail_path = /usr/sbin/sendmail.postfix setgid_group = postdrop smtp_connection_cache_destinations = bellsouth.net smtpd_recipient_restrictions = check_recipient_mx_access cidr:$config_directory/bogus_mx, check_recipient_access hash:$config_directory/recipient_access, permit_mynetworks, reject_unauth_destination smtpd_timeout = 600s syslog_name = postfix88 transport_maps = hash:$config_directory/transport unknown_local_recipient_reject_code = 550 virtual_alias_maps = hash:$config_directory/virtual We have a few transports set up in master.cf: smtp inet n - n - 200 smtpd lowconn unix - - n - - smtp highconn unix - - n - - smtp comcast unix - - n - 2 smtp bellsouth unix - - n - 1 smtp -o smtp_connection_cache_time_limit=7 att unix - - n - 2 smtp -o smtp_connection_cache_time_limit=5 yahoo unix - - n - 25 smtp -o smtp_connection_cache_time_limit=15 -o smtp_destination_concurrency_limit=30 gmail unix - - n - 50 smtp -o smtp_connection_cache_time_limit=15 -o smtp_destination_concurrency_limit=100 and the transport map looks something like this for the various transports: # Giving gmail their own transport gmail.com gmail: Logs for the delayed emails are rather boring, looking something like this: Apr 13 23:52:16 xxx-mail88 postfix88/smtpd[29448]: 56F77E7784AD: client=sendingserver.example.net[10.1.1.XXX] Apr 13 23:52:16 xxx-mail88 postfix88/cleanup[12513]: 56F77E7784AD: message-id=<214e220ccd04ac46.1428987136273.SendAlerts.tomcat@sendingserver> Apr 13 23:52:16 xxx-mail88 postfix88/qmgr[22994]: 56F77E7784AD: from=<al...@example.com>, size=73004, nrcpt=1 (queue active) Apr 14 00:00:00 xxx-mail88 postfix88/smtp[13542]: 56F77E7784AD: to=<persongettingtheal...@gmail.com>, relay=gmail-smtp-in.l.google.com[173.194.72.27]:25, conn_use=19, delay=465, delays=0.1/463/0.29/1.6, dsn=2.0.0, status=sent (250 2.0.0 OK 1428987600 no10si18948696pdb.63 - gsmtp) Apr 14 00:00:00 xxx-mail88 postfix88/qmgr[22994]: 56F77E7784AD: removed Usually I'm seeing about 20-25 of the gmail smtp processes in the process list at peak, but maybe 10000 gmail emails in the active queue as reported by qshape. I've read through the tuning documents a few times not and I'm not sure what I'm missing here, but I'm confident that the problem lies with the guy typing this email, so I'm happy to provide any information needed to get over my lack of understanding. Thanks! Andrew