RE: Backlog to outsourced email provider

Xie, Wei Mon, 28 Apr 2014 16:06:57 -0700

Victor,

>>You need to post "postconf -n" output or at least:
>>
>>      default_destination_concurrency_limit
>>      smtp_destination_concurrency_limit
>>
>>check which transport is used for this domain, and if not "smtp", post the 
>>concurrency limit for that.


Here are the output of 'posconf -n':

alias_database = dbm:/etc/aliases
alias_maps = hash:/etc/aliases
command_directory = /usr/sbin
config_directory = /etc/postfix
daemon_directory = /usr/libexec/postfix
data_directory = /var/lib/postfix
debug_peer_level = 2
header_checks = regexp:/etc/postfix/header_checks
html_directory = no
inet_interfaces = $myhostname, localhost
inet_protocols = ipv4
mail_owner = postfix
mailbox_size_limit = 53000000
mailq_path = /usr/bin/mailq.postfix
manpage_directory = /usr/share/man
message_size_limit = 52428800
mydestination = $myhostname, localhost.$mydomain, localhost
mydomain = osuad.osu.edu
myhostname = cio-krc-pf03.osuad.osu.edu
newaliases_path = /usr/bin/newaliases.postfix
queue_directory = /var/spool/postfix
readme_directory = /usr/share/doc/postfix-2.6.14-documentation/readme
relayhost = mail.us.messaging.microsoft.com
sample_directory = /usr/share/doc/postfix-2.6.14-documentation/samples
sendmail_path = /usr/sbin/sendmail.postfix
setgid_group = postdrop
smtp_tls_CAfile = /etc/postfix/service_certs/osu_ues/DigiCertCA.crt
smtp_tls_loglevel = 2
smtp_tls_note_starttls_offer = yes
smtp_tls_security_level = encrypt
smtp_tls_session_cache_database = btree:/var/lib/postfix/smtp_scache
smtpd_tls_CAfile = /etc/postfix/service_certs/osu_ues/DigiCertCA.crt
smtpd_tls_CApath = /etc/postfix/service_certs/osu_ues
smtpd_tls_cert_file = 
/etc/postfix/service_certs/osu_ues/OSU_UES_WC_Cert_Ex_certificate.pem
smtpd_tls_key_file = 
/etc/postfix/service_certs/osu_ues/OSU_UES_WC_Cert_Ex_key.pem
smtpd_tls_loglevel = 2
smtpd_tls_received_header = yes
smtpd_tls_security_level = may
smtpd_tls_session_cache_database = btree:/var/lib/postfix/smtpd_scache
unknown_local_recipient_reject_code = 550

Here are the settings for the following two parameters:

default_destination_concurrency_limit = 20
smtp_destination_concurrency_limit = $default_destination_concurrency_limit

No transport is used so far.

> >>> Have you configured any concurrency controls or rate delay for this 
> >>> destination?
> 
> No. keep default unchanged.  Which parameters for concurrency controls or 
> rate delay need to be checked?
>
>    All <transport>_destination_concurrency_limit settings in main.cf
>
 >   Any <transport>_destination_rate_delay settings in main.cf

There is no parameter defined for transport in main.cf.

>>Either increase concurrency or reduce latency.  Network captures may show 
>>which protocol stage is responsible for most of the delay, even with TLS one 
>>can tell whether the delay is at >>the beginning or at the end of the TLS 
>>session or just low bandwidth throughout.

We prefer to increase concurrency.


>>> How can we check the new server has a working local DNS cache?
>>> Check the file /etc/resolv.conf?
>>
>>Yes, but also time MX, A and AAAA lookups for the destination relay.

# time dig MX mail.us.messaging.microsoft.com
real    0m0.014s
user    0m0.002s
sys     0m0.003s

# time dig A mail.us.messaging.microsoft.com
real    0m0.016s
user    0m0.004s
sys     0m0.004s

# time dig AAAA mail.us.messaging.microsoft.com
real    0m0.058s
user    0m0.001s
sys     0m0.004s

>>How is the relay specified with or without surrounding "[]"?

Without surrounding "[]".

relayhost = mail.us.messaging.microsoft.com

>> In peak hours 10:30:00~ 10:59:59, other servers running Postfix-2.6.6 
>> on RHEL 6.4 were fine.
>
>That's meaningless, what was their output rate?  What was their input rate?  
>What was the typical "c+d" latency.  If you want help with performance 
>problems you need to start gathering >and crunching data, being lazy and 
>avoiding hard numbers is not an option.

You are total correct.

On this RHEL 5.10 server, today 10:30:00 ~ 10:59:59 the output rate of email to 
this domain in the 30 minutes was 10,928.

On other 6 RHEL 6.4 servers, today 10:30:00 ~ 10:59:59 the output rate of email 
to this domain in the 30 minutes were 4,824 ~ 6,564.

>> Our outbound emails have four main destination domains to be relayed 
>> to Windows FOPE.
>> 
>> Buckeyemail.osu.edu ---------------> mail.us.messaging.microsoft.com
>> Gmail.com    --------------------------->  mail.us.messaging.microsoft.com
>> Yahoo.com  ----------------------------> mail.us.messaging.microsoft.com 
>>Hotmail.com ---------------------------> mail.us.messaging.microsoft.com

>Did you measure the output rate for all mail destined to this relay, or just 
>the first domain?  The correct measurement is to aggregate counts by transport 
>next-hop.  Please report output >rates for all these combined, or rather all 
>mail with a relay of "mail.us.messaging.microsoft.com".

We measure the output rate for all mail destined to this relay. I use your 
criteria to double check again and get the following output rate. The data 
which I gave to you in previous email is not accurate.

Today 10:00:00 ~ 10:29:59 the output rate of email to this relay in the 30 
minutes was 9,623.

Today 10:30:00 ~ 10:59:59 the output rate of email to this relay in the 30 
minutes was 10,928.
                                                     the output rate of email 
to domain "buckeymail.osu.edu" via this relay was 6,399
                                                     the output rate of email 
to domain "gmail.com" via this relay was 2,803
                                                     the output rate of email 
to domain "yahoo.com" via this relay was 619
                                                     the output rate of email 
to domain "Hotmail.com" via this relay was 336
                                                     the output rate of email 
to other domains  via this relay was 771

Today 11:00:00 ~ 11:29:59 the output rate of email to this relay in the 30 
minutes was 15,597.

Thanks,

Carl

-----Original Message-----
From: owner-postfix-us...@postfix.org [mailto:owner-postfix-us...@postfix.org] 
On Behalf Of Viktor Dukhovni
Sent: Monday, April 28, 2014 2:42 PM
To: postfix-users@postfix.org
Subject: Re: Backlog to outsourced email provider

On Mon, Apr 28, 2014 at 06:09:43PM +0000, Xie, Wei wrote:

> When congestion occurred, all other messages to the same domain were 
> similarly delayed. The delay are longer and longer  (the longest 
> exceeded 1200 seconds) and the length of active queue is longer and 
> longer (get to know from the outputs from commands 'qshape active' and 
> 'mailq |grep \* |wc -l, the queued messages were over 9,000).

Clearly the output rate is not keeping up with the input rate.

> >>> What is the complete set of logs for this queue-id?  
> 
> Apr 28 10:47:11 cio-krc-pf03 postfix/smtpd[31853]: 9934181190: 
> client=cio-tnc-ht06.osuad.osu.edu[164.107.81.171]
> Apr 28 10:47:11 cio-krc-pf03 postfix/qmgr[31812]: 9934181190: 
> from=<erequest.do.not.re...@osu.edu>, size=1905, nrcpt=1 (queue 
> active) Apr 28 11:03:18 cio-krc-pf03 postfix/smtp[5015]: 9934181190: 
> to=<turek...@buckeyemail.osu.edu>, 
> relay=mail.us.messaging.microsoft.com[216.32.181.178]:25, delay=967, 
> delays=0/964/1.5/1.3, dsn=2.6.0, status=sent (250 2.6.0 
> <27520027.166481398696426625.javamail.erequest.do.not.re...@osu.edu> 
> [InternalId=9787221] Queued mail for delivery) Apr 28 11:03:18 
> cio-krc-pf03 postfix/qmgr[31812]: 9934181190: removed

Yes indeed nothing seems to happen for 964 seconds sitting in the queue.

> >>>What was the output rate of email to this domain in the 30  minutes 
> >>>preceeding this log entry?  (Avoid counting multiple  recipients 
> >>>with the same queue-id, relay and remote server  response as 
> >>>separate deliveries).
> 
> Today 10:00:00 ~ 10:29:59 the output rate of email to this domain in the 30 
> minutes was 15,361.
> Today 10:30:00 ~ 10:59:59 the output rate of email to this domain in the 30 
> minutes was 28,827.
> Today 11:00:00 ~ 11:29:59 the output rate of email to this domain in the 30 
> minutes was 111,27.

Can you clarify that last one, is that ~11 thousand, the comma seems misplaced. 
 The earlier rate appears to be ~100 messages per minute, or just over 1 per 
second.  You should measure the average "c+d" in the log for these time frames, 
again counting multiple recipients in a single delivery as one event.

Supposing the 2.8 second delivery latency to typical, a delivery rate of 1-2 
messages per second suggests a destination concurrency limit of "2", rather 
than the default limit of 20.

You need to post "postconf -n" output or at least:

        default_destination_concurrency_limit
        smtp_destination_concurrency_limit

check which transport is used for this domain, and if not "smtp", post the 
concurrency limit for that.

> >>> Have you configured any concurrency controls or rate delay for this 
> >>> destination?
> 
> No. keep default unchanged.  Which parameters for concurrency controls or 
> rate delay need to be checked?

    All <transport>_destination_concurrency_limit settings in main.cf

    Any <transport>_destination_rate_delay settings in main.cf

> >>>This delay means a large number of messages waiting behind the messages 
> >>>currently being delivered, subject to concurrency and rate delays.
> 
> How can we increase delivery rate so that b-delay is down?

Either increase concurrency or reduce latency.  Network captures may show which 
protocol stage is responsible for most of the delay, even with TLS one can tell 
whether the delay is at the beginning or at the end of the TLS session or just 
low bandwidth throughout.

> How can we check the new server has a working local DNS cache?
> Check the file /etc/resolv.conf?

Yes, but also time MX, A and AAAA lookups for the destination relay.
How is the relay specified with or without surrounding "[]"?

> In peak hours 10:30:00~ 10:59:59, other servers running Postfix-2.6.6 
> on RHEL 6.4 were fine.

That's meaningless, what was their output rate?  What was their input rate?  
What was the typical "c+d" latency.  If you want help with performance problems 
you need to start gathering and crunching data, being lazy and avoiding hard 
numbers is not an option.

> Only this server running Postfix-2.6.6 on RHEL 5.10 experienced 
> serious delay.

Delay happens when the input rate exceeds the output rate.

> Do we need change some parameters to increase delivery rate or set 
> special channel/allocate fixed SMTP processes for specified outbound 
> domains?

Random parameter twiddling rarely solves congestion, but it can cause it.  
Before changing anything the reason for the congestion needs to be identified.  
The output rate looks anaemic to me, why is the output concurrency so low?

> Our outbound emails have four main destination domains to be relayed 
> to Windows FOPE.
> 
> Buckeyemail.osu.edu ---------------> mail.us.messaging.microsoft.com
> Gmail.com    --------------------------->  mail.us.messaging.microsoft.com
> Yahoo.com  ----------------------------> 
> mail.us.messaging.microsoft.com Hotmail.com 
> ---------------------------> mail.us.messaging.microsoft.com

Did you measure the output rate for all mail destined to this relay, or just 
the first domain?  The correct measurement is to aggregate counts by transport 
next-hop.  Please report output rates for all these combined, or rather all 
mail with a relay of "mail.us.messaging.microsoft.com".

-- 
        Viktor.

RE: Backlog to outsourced email provider

Reply via email to