RE: Backlog to outsourced email provider

Xie, Wei Wed, 30 Apr 2014 21:26:29 -0700

Victor,

>>>Fix your logging, then measure again.  A concurrency of 20 may be sufficient 
>>>when the log level is sane.

Monday night I fixed my logging as below.

> smtp_tls_loglevel = 1
> smtpd_tls_loglevel = 1

Here are throughput per 30 minutes from 8:00 through 18:00 yesterday and today

April 29 (Tuesday):
08:00:00 - 08:29:59:  10961 
08:30:00 - 08:59:59:  13615 
09:00:00 - 09:29:59:  14595 
09:30:00 - 09:59:59:  8773  
10:00:00 - 10:29:59:  14430 
10:30:00 - 10:59:59:  10008 
11:00:00 - 11:29:59:  15775 
11:30:00 - 11:59:59:  8831  
12:00:00 - 12:29:59:  10278 
12:30:00 - 12:59:59:  7385  
13:00:00 - 13:29:59:  10667 
13:30:00 - 13:59:59:  11157 
14:00:00 - 14:29:59:  14754 
14:30:00 - 14:59:59:  16204 
15:00:00 - 15:29:59:  14562 
15:30:00 - 15:59:59:  8669  
16:00:00 - 16:29:59:  12502 
16:30:00 - 16:59:59:  5390  
17:00:00 - 17:29:59:  10168 
17:30:00 - 17:59:59:  11201 
18:00:00 - 18:29:59:  11841 
18:30:00 - 18:59:59:  5495  

April 30 (Wednesday):
08:00:00 - 08:29:59:  12537  
08:30:00 - 08:59:59:  6535   
09:00:00 - 09:29:59:  10978  
09:30:00 - 09:59:59:  9147   
10:00:00 - 10:29:59:  18220  
10:30:00 - 10:59:59:  12779  
11:00:00 - 11:29:59:  12659  
11:30:00 - 11:59:59:  8974   
12:00:00 - 12:29:59:  13835  
12:30:00 - 12:59:59:  14805  
13:00:00 - 13:29:59:  16831  
13:30:00 - 13:59:59:  7153   
14:00:00 - 14:29:59:  11017  
14:30:00 - 14:59:59:  10422  
15:00:00 - 15:29:59:  15617  
15:30:00 - 15:59:59:  11271  
16:00:00 - 16:29:59:  11120  
16:30:00 - 16:59:59:  7963   
17:00:00 - 17:29:59:  7759   
17:30:00 - 17:59:59:  4817   
18:00:00 - 18:29:59:  5815   
18:30:00 - 18:59:59:  3581   

>>>Understand and memorize this simple formula:
>>>
>>>     Throughput = Concurrency / Latency

If Latency = 20, Concurrency=2.8s, Throughput=7.14286/second, which is equal to 
12,857/30minutes. Is this a threshold? If real throughput is approximately 
greater than this number, delays will obviously occurred in peak hours as below 
on Monday, right? 

April 28 (Monday):
10:30:00 - 10:59:59:  10928 delays (>=120s) were consecutive - 121s ~ 2402s
11:00:00 - 11:29:59:  15597 delays (>=120s) were consecutive - 648s ~ 2380s
11:30:00 - 11:59:59:  3821  delays (>=120s) were consecutive - 514s ~ 813s

If we increase default_destination_concurrency_limit = 30, the threshold of 
throughput per 30 minutes will be19,285.71 ( 30/2.8 * 1800=19,285.71), which is 
greater than throughput yesterday and today. Does this avoid obvious delays?

Also, I read Postfix Performance Tuning at URL 
http://www.postfix.org/TUNING_README.html about "Tuning the number of 
simultaneous deliveries" and " Tuning the number of recipients per delivery".

    * For high volume destination, it seems we are able to increase 
default_destination_concurrency (20->30?) and lower  smtp_connection_timeout 
(30s ->5s);
    * For high volume destination, it seems we are able to increase 
default_destination_recipient_limit (50 ->100?)

And, I read Postfix Bottleneck Analysis URL 
http://www.postfix.org/QSHAPE_README.html about " The active queue" 

(The only way to reduce congestion is to either reduce the input rate or 
increase the throughput. Increasing the throughput requires either increasing 
the concurrency or reducing the latency of deliveries.

For high volume sites a key tuning parameter is the number of "smtp" delivery 
agents allocated to the "smtp" and "relay" transports. High volume sites tend 
to send to many different destinations, many of which may be down or slow, so a 
good fraction of the available delivery agents will be blocked waiting for slow 
sites. Also mail destined across the globe will incur large SMTP 
command-response latencies, so high message throughput can only be achieved 
with more concurrent delivery agents. ) 

and " Example 4: High volume destination backlog", including the following 
paragraph:

************************************
Postfix version 2.5 and later:

    In master.cf set up a dedicated clone of the "smtp" transport for the 
destination in question. In the example below we will call it "fragile".

    In master.cf configure a reasonable process limit for the cloned smtp 
transport (a number in the 10-20 range is typical).

    IMPORTANT!!! In main.cf configure a large per-destination pseudo-cohort 
failure limit for the cloned smtp transport.

    /etc/postfix/main.cf:
        transport_maps = hash:/etc/postfix/transport
        fragile_destination_concurrency_failed_cohort_limit = 100
        fragile_destination_concurrency_limit = 20

    /etc/postfix/transport:
        example.com  fragile:

    /etc/postfix/master.cf:
        # service type  private unpriv  chroot  wakeup  maxproc command
        fragile   unix     -       -       n       -      20    smtp

    See also the documentation for 
default_destination_concurrency_failed_cohort_limit and 
default_destination_concurrency_limit
*******************************************************************************************************

Can we divide destination domains into three transport groups and create two 
clones of  the "smtp" transport? We do the configuration test on our test 
server and it seems new extra smtp processes are created for both "buckeye" 
transport and "famous-ISP" transport although default smtp processes are still 
created . Will this change further reduce latency and provide more concurrency 
so that throughput will increase ?

Group1 - buckeyemail.osu.edu uses "buckeye" transport
Group2 - gmail.com, yahoo.com and Hotmail.com use "famous-ISP" transport
Group3 - other domains use default "smtp" transport

    /etc/postfix/main.cf:
        transport_maps = hash:/etc/postfix/transport
        buckeye_destination_concurrency_failed_cohort_limit = 100
        buckeye_destination_concurrency_limit = 30
        famous-ISP_destination_concurrency_failed_cohort_limit = 100
        famous-ISP_destination_concurrency_limit = 20

    /etc/postfix/transport:
        Buckeyemail.osu.edu  buckeye: mail.us.messaging.microsoft.com
        Gmail.com famous-ISP: mail.us.messaging.microsoft.com
        Yahoo.com famous-ISP: mail.us.messaging.microsoft.com
        Hotmail.com famous-ISP: mail.us.messaging.microsoft.com

    /etc/postfix/master.cf:
        # service type  private unpriv  chroot  wakeup  maxproc command
        buckeye   unix     -       -       n       -      30    smtp
                            -o smtp_connect_timeout=5
        famous-ISP unix -       -       n       -      20    smtp

Thanks and good night,

Carl

-----Original Message-----
From: owner-postfix-us...@postfix.org [mailto:owner-postfix-us...@postfix.org] 
On Behalf Of Viktor Dukhovni
Sent: Monday, April 28, 2014 7:45 PM
To: postfix-users@postfix.org
Subject: Re: Backlog to outsourced email provider

On Mon, Apr 28, 2014 at 11:05:56PM +0000, Xie, Wei wrote:

> header_checks = regexp:/etc/postfix/header_checks relayhost = 
> mail.us.messaging.microsoft.com

This is effectively a miniature transport entry:

    relay_transport = relay:mail.us.messaging.microsoft.com
    default_transport = relay:mail.us.messaging.microsoft.com

Don't know whether the vendor intends for you to do MX lookups here or not 
(you're doing MX lookups).  The MX record just returns the original hostname.

    $ dig +noall +ans -t mx mail.us.messaging.microsoft.com
    mail.us.messaging.microsoft.com. IN MX 10 mail.us.messaging.microsoft.com.

    $ dig +noall +ans -t a mail.us.messaging.microsoft.com
    mail.us.messaging.microsoft.com. IN  A 216.32.181.178
    mail.us.messaging.microsoft.com. IN  A 216.32.180.22

> smtp_tls_CAfile = /etc/postfix/service_certs/osu_ues/DigiCertCA.crt
> smtp_tls_loglevel = 2
> smtpd_tls_loglevel = 2

You're killing your syslog daemon with debug logging.  Why is the TLS loglevel 
set to 2?  Have you looked at your logs?  They are full of debugging noise and 
likely severely limit performance.
For normal operation set the log level to 1.  Also make sure your syslogd is 
not doing synchronous logging of each log entry.

> smtp_tls_note_starttls_offer = yes

Futile, given:

> smtp_tls_security_level = encrypt

> Here are the settings for the following two parameters:
> 
> default_destination_concurrency_limit = 20

Fix your logging, then measure again.  A concurrency of 20 may be sufficient 
when the log level is sane.

> smtp_destination_concurrency_limit = 
> $default_destination_concurrency_limit

This is redundant.

> >>Either increase concurrency or reduce latency.  Network captures may show 
> >>which protocol stage is responsible for most of the delay, even with TLS 
> >>one can tell whether the delay is at >>the beginning or at the end of the 
> >>TLS session or just low bandwidth throughout.
> 
> We prefer to increase concurrency.

The vendor might limit your concurrency, don't do that quite yet.

> >>How is the relay specified with or without surrounding "[]"?
> 
> Without surrounding "[]".
> 
> relayhost = mail.us.messaging.microsoft.com

Ask the vendor whether they want you to use MX indirection or not.

> On this RHEL 5.10 server, today 10:30:00 ~ 10:59:59 the output rate of 
> email to this domain in the 30 minutes was 10,928.
> 
> On other 6 RHEL 6.4 servers, today 10:30:00 ~ 10:59:59 the output rate 
> of email to this domain in the 30 minutes were 4,824 ~ 6,564.

You're comparing apples and oranges, the RHEL 6 hosts don't receive nearly 
enough traffic to be congested, they would perhaps be equally congested under 
the same load.  However, they may have sensibly configured logging with TLS 
loglevel 1, and/or no synchronous log writes.

> Today 10:00:00 ~ 10:29:59 the output rate of email to this relay in 
> the 30 minutes was 9,623.
> 
> Today 10:30:00 ~ 10:59:59 the output rate of email to this relay in 
> the 30 minutes was 10,928.

That's more like it: Throughput * Latency = Concurrency

    10928 / 1800 * 2.8 = 16.8

So with latencies around 2.8 seconds your estimate concurrency is
~17 which is close enough to 20.  The problem is either that your syslogd is 
overwhelmed and too slow or the vendor service is too slow.
Fix the first problem first.

> Today 11:00:00 ~ 11:29:59 the output rate of email to this relay in 
> the 30 minutes was 15,597.

    15597 / 1800 * 2.8 = 22.4

So the latency number from that one message is likely a bit above average.  
Understand and memorize this simple formula:

        Throughput = Concurrency / Latency

fix your logging settings in main.cf and make sure that you follow the advise 
at the bottom of:

    http://www.postfix.org/LINUX_README.html

        Syslogd performance

        LINUX syslogd uses synchronous writes by default. Because of
        this, syslogd can actually use more system resources than
        Postfix. To avoid such badness, disable synchronous mail logfile
        writes by editing /etc/syslog.conf and by prepending a "-" to
        the logfile name:

            /etc/syslog.conf:
                mail.*                          -/var/log/mail.log

        Send a "kill -HUP" to the syslogd to make the change effective.

-- 
        Viktor.

RE: Backlog to outsourced email provider

Reply via email to