Victor, >>>Fix your logging, then measure again. A concurrency of 20 may be sufficient >>>when the log level is sane.
Monday night I fixed my logging as below. > smtp_tls_loglevel = 1 > smtpd_tls_loglevel = 1 Here are throughput per 30 minutes from 8:00 through 18:00 yesterday and today April 29 (Tuesday): 08:00:00 - 08:29:59: 10961 08:30:00 - 08:59:59: 13615 09:00:00 - 09:29:59: 14595 09:30:00 - 09:59:59: 8773 10:00:00 - 10:29:59: 14430 10:30:00 - 10:59:59: 10008 11:00:00 - 11:29:59: 15775 11:30:00 - 11:59:59: 8831 12:00:00 - 12:29:59: 10278 12:30:00 - 12:59:59: 7385 13:00:00 - 13:29:59: 10667 13:30:00 - 13:59:59: 11157 14:00:00 - 14:29:59: 14754 14:30:00 - 14:59:59: 16204 15:00:00 - 15:29:59: 14562 15:30:00 - 15:59:59: 8669 16:00:00 - 16:29:59: 12502 16:30:00 - 16:59:59: 5390 17:00:00 - 17:29:59: 10168 17:30:00 - 17:59:59: 11201 18:00:00 - 18:29:59: 11841 18:30:00 - 18:59:59: 5495 April 30 (Wednesday): 08:00:00 - 08:29:59: 12537 08:30:00 - 08:59:59: 6535 09:00:00 - 09:29:59: 10978 09:30:00 - 09:59:59: 9147 10:00:00 - 10:29:59: 18220 10:30:00 - 10:59:59: 12779 11:00:00 - 11:29:59: 12659 11:30:00 - 11:59:59: 8974 12:00:00 - 12:29:59: 13835 12:30:00 - 12:59:59: 14805 13:00:00 - 13:29:59: 16831 13:30:00 - 13:59:59: 7153 14:00:00 - 14:29:59: 11017 14:30:00 - 14:59:59: 10422 15:00:00 - 15:29:59: 15617 15:30:00 - 15:59:59: 11271 16:00:00 - 16:29:59: 11120 16:30:00 - 16:59:59: 7963 17:00:00 - 17:29:59: 7759 17:30:00 - 17:59:59: 4817 18:00:00 - 18:29:59: 5815 18:30:00 - 18:59:59: 3581 >>>Understand and memorize this simple formula: >>> >>> Throughput = Concurrency / Latency If Latency = 20, Concurrency=2.8s, Throughput=7.14286/second, which is equal to 12,857/30minutes. Is this a threshold? If real throughput is approximately greater than this number, delays will obviously occurred in peak hours as below on Monday, right? April 28 (Monday): 10:30:00 - 10:59:59: 10928 delays (>=120s) were consecutive - 121s ~ 2402s 11:00:00 - 11:29:59: 15597 delays (>=120s) were consecutive - 648s ~ 2380s 11:30:00 - 11:59:59: 3821 delays (>=120s) were consecutive - 514s ~ 813s If we increase default_destination_concurrency_limit = 30, the threshold of throughput per 30 minutes will be19,285.71 ( 30/2.8 * 1800=19,285.71), which is greater than throughput yesterday and today. Does this avoid obvious delays? Also, I read Postfix Performance Tuning at URL http://www.postfix.org/TUNING_README.html about "Tuning the number of simultaneous deliveries" and " Tuning the number of recipients per delivery". * For high volume destination, it seems we are able to increase default_destination_concurrency (20->30?) and lower smtp_connection_timeout (30s ->5s); * For high volume destination, it seems we are able to increase default_destination_recipient_limit (50 ->100?) And, I read Postfix Bottleneck Analysis URL http://www.postfix.org/QSHAPE_README.html about " The active queue" (The only way to reduce congestion is to either reduce the input rate or increase the throughput. Increasing the throughput requires either increasing the concurrency or reducing the latency of deliveries. For high volume sites a key tuning parameter is the number of "smtp" delivery agents allocated to the "smtp" and "relay" transports. High volume sites tend to send to many different destinations, many of which may be down or slow, so a good fraction of the available delivery agents will be blocked waiting for slow sites. Also mail destined across the globe will incur large SMTP command-response latencies, so high message throughput can only be achieved with more concurrent delivery agents. ) and " Example 4: High volume destination backlog", including the following paragraph: ************************************ Postfix version 2.5 and later: In master.cf set up a dedicated clone of the "smtp" transport for the destination in question. In the example below we will call it "fragile". In master.cf configure a reasonable process limit for the cloned smtp transport (a number in the 10-20 range is typical). IMPORTANT!!! In main.cf configure a large per-destination pseudo-cohort failure limit for the cloned smtp transport. /etc/postfix/main.cf: transport_maps = hash:/etc/postfix/transport fragile_destination_concurrency_failed_cohort_limit = 100 fragile_destination_concurrency_limit = 20 /etc/postfix/transport: example.com fragile: /etc/postfix/master.cf: # service type private unpriv chroot wakeup maxproc command fragile unix - - n - 20 smtp See also the documentation for default_destination_concurrency_failed_cohort_limit and default_destination_concurrency_limit ******************************************************************************************************* Can we divide destination domains into three transport groups and create two clones of the "smtp" transport? We do the configuration test on our test server and it seems new extra smtp processes are created for both "buckeye" transport and "famous-ISP" transport although default smtp processes are still created . Will this change further reduce latency and provide more concurrency so that throughput will increase ? Group1 - buckeyemail.osu.edu uses "buckeye" transport Group2 - gmail.com, yahoo.com and Hotmail.com use "famous-ISP" transport Group3 - other domains use default "smtp" transport /etc/postfix/main.cf: transport_maps = hash:/etc/postfix/transport buckeye_destination_concurrency_failed_cohort_limit = 100 buckeye_destination_concurrency_limit = 30 famous-ISP_destination_concurrency_failed_cohort_limit = 100 famous-ISP_destination_concurrency_limit = 20 /etc/postfix/transport: Buckeyemail.osu.edu buckeye: mail.us.messaging.microsoft.com Gmail.com famous-ISP: mail.us.messaging.microsoft.com Yahoo.com famous-ISP: mail.us.messaging.microsoft.com Hotmail.com famous-ISP: mail.us.messaging.microsoft.com /etc/postfix/master.cf: # service type private unpriv chroot wakeup maxproc command buckeye unix - - n - 30 smtp -o smtp_connect_timeout=5 famous-ISP unix - - n - 20 smtp Thanks and good night, Carl -----Original Message----- From: owner-postfix-us...@postfix.org [mailto:owner-postfix-us...@postfix.org] On Behalf Of Viktor Dukhovni Sent: Monday, April 28, 2014 7:45 PM To: postfix-users@postfix.org Subject: Re: Backlog to outsourced email provider On Mon, Apr 28, 2014 at 11:05:56PM +0000, Xie, Wei wrote: > header_checks = regexp:/etc/postfix/header_checks relayhost = > mail.us.messaging.microsoft.com This is effectively a miniature transport entry: relay_transport = relay:mail.us.messaging.microsoft.com default_transport = relay:mail.us.messaging.microsoft.com Don't know whether the vendor intends for you to do MX lookups here or not (you're doing MX lookups). The MX record just returns the original hostname. $ dig +noall +ans -t mx mail.us.messaging.microsoft.com mail.us.messaging.microsoft.com. IN MX 10 mail.us.messaging.microsoft.com. $ dig +noall +ans -t a mail.us.messaging.microsoft.com mail.us.messaging.microsoft.com. IN A 216.32.181.178 mail.us.messaging.microsoft.com. IN A 216.32.180.22 > smtp_tls_CAfile = /etc/postfix/service_certs/osu_ues/DigiCertCA.crt > smtp_tls_loglevel = 2 > smtpd_tls_loglevel = 2 You're killing your syslog daemon with debug logging. Why is the TLS loglevel set to 2? Have you looked at your logs? They are full of debugging noise and likely severely limit performance. For normal operation set the log level to 1. Also make sure your syslogd is not doing synchronous logging of each log entry. > smtp_tls_note_starttls_offer = yes Futile, given: > smtp_tls_security_level = encrypt > Here are the settings for the following two parameters: > > default_destination_concurrency_limit = 20 Fix your logging, then measure again. A concurrency of 20 may be sufficient when the log level is sane. > smtp_destination_concurrency_limit = > $default_destination_concurrency_limit This is redundant. > >>Either increase concurrency or reduce latency. Network captures may show > >>which protocol stage is responsible for most of the delay, even with TLS > >>one can tell whether the delay is at >>the beginning or at the end of the > >>TLS session or just low bandwidth throughout. > > We prefer to increase concurrency. The vendor might limit your concurrency, don't do that quite yet. > >>How is the relay specified with or without surrounding "[]"? > > Without surrounding "[]". > > relayhost = mail.us.messaging.microsoft.com Ask the vendor whether they want you to use MX indirection or not. > On this RHEL 5.10 server, today 10:30:00 ~ 10:59:59 the output rate of > email to this domain in the 30 minutes was 10,928. > > On other 6 RHEL 6.4 servers, today 10:30:00 ~ 10:59:59 the output rate > of email to this domain in the 30 minutes were 4,824 ~ 6,564. You're comparing apples and oranges, the RHEL 6 hosts don't receive nearly enough traffic to be congested, they would perhaps be equally congested under the same load. However, they may have sensibly configured logging with TLS loglevel 1, and/or no synchronous log writes. > Today 10:00:00 ~ 10:29:59 the output rate of email to this relay in > the 30 minutes was 9,623. > > Today 10:30:00 ~ 10:59:59 the output rate of email to this relay in > the 30 minutes was 10,928. That's more like it: Throughput * Latency = Concurrency 10928 / 1800 * 2.8 = 16.8 So with latencies around 2.8 seconds your estimate concurrency is ~17 which is close enough to 20. The problem is either that your syslogd is overwhelmed and too slow or the vendor service is too slow. Fix the first problem first. > Today 11:00:00 ~ 11:29:59 the output rate of email to this relay in > the 30 minutes was 15,597. 15597 / 1800 * 2.8 = 22.4 So the latency number from that one message is likely a bit above average. Understand and memorize this simple formula: Throughput = Concurrency / Latency fix your logging settings in main.cf and make sure that you follow the advise at the bottom of: http://www.postfix.org/LINUX_README.html Syslogd performance LINUX syslogd uses synchronous writes by default. Because of this, syslogd can actually use more system resources than Postfix. To avoid such badness, disable synchronous mail logfile writes by editing /etc/syslog.conf and by prepending a "-" to the logfile name: /etc/syslog.conf: mail.* -/var/log/mail.log Send a "kill -HUP" to the syslogd to make the change effective. -- Viktor.