RE: Backlog to outsourced email provider

Xie, Wei Mon, 28 Apr 2014 11:59:12 -0700

Viktor,

> Today 11:00:00 ~ 11:29:59 the output rate of email to this domain in the 30 
> minutes was 111,27.

It is typo. Should be 11,127.

I will carefully read other part of your email and digest them.

Thanks,

Carl

-----Original Message-----
From: owner-postfix-us...@postfix.org [mailto:owner-postfix-us...@postfix.org] 
On Behalf Of Viktor Dukhovni
Sent: Monday, April 28, 2014 2:42 PM
To: postfix-users@postfix.org
Subject: Re: Backlog to outsourced email provider

On Mon, Apr 28, 2014 at 06:09:43PM +0000, Xie, Wei wrote:

> When congestion occurred, all other messages to the same domain were 
> similarly delayed. The delay are longer and longer  (the longest 
> exceeded 1200 seconds) and the length of active queue is longer and 
> longer (get to know from the outputs from commands 'qshape active' and 
> 'mailq |grep \* |wc -l, the queued messages were over 9,000).

Clearly the output rate is not keeping up with the input rate.

> >>> What is the complete set of logs for this queue-id?  
> 
> Apr 28 10:47:11 cio-krc-pf03 postfix/smtpd[31853]: 9934181190: 
> client=cio-tnc-ht06.osuad.osu.edu[164.107.81.171]
> Apr 28 10:47:11 cio-krc-pf03 postfix/qmgr[31812]: 9934181190: 
> from=<erequest.do.not.re...@osu.edu>, size=1905, nrcpt=1 (queue 
> active) Apr 28 11:03:18 cio-krc-pf03 postfix/smtp[5015]: 9934181190: 
> to=<turek...@buckeyemail.osu.edu>, 
> relay=mail.us.messaging.microsoft.com[216.32.181.178]:25, delay=967, 
> delays=0/964/1.5/1.3, dsn=2.6.0, status=sent (250 2.6.0 
> <27520027.166481398696426625.javamail.erequest.do.not.re...@osu.edu> 
> [InternalId=9787221] Queued mail for delivery) Apr 28 11:03:18 
> cio-krc-pf03 postfix/qmgr[31812]: 9934181190: removed

Yes indeed nothing seems to happen for 964 seconds sitting in the queue.

> >>>What was the output rate of email to this domain in the 30  minutes 
> >>>preceeding this log entry?  (Avoid counting multiple  recipients 
> >>>with the same queue-id, relay and remote server  response as 
> >>>separate deliveries).
> 
> Today 10:00:00 ~ 10:29:59 the output rate of email to this domain in the 30 
> minutes was 15,361.
> Today 10:30:00 ~ 10:59:59 the output rate of email to this domain in the 30 
> minutes was 28,827.
> Today 11:00:00 ~ 11:29:59 the output rate of email to this domain in the 30 
> minutes was 111,27.

Can you clarify that last one, is that ~11 thousand, the comma seems misplaced. 
 The earlier rate appears to be ~100 messages per minute, or just over 1 per 
second.  You should measure the average "c+d" in the log for these time frames, 
again counting multiple recipients in a single delivery as one event.

Supposing the 2.8 second delivery latency to typical, a delivery rate of 1-2 
messages per second suggests a destination concurrency limit of "2", rather 
than the default limit of 20.

You need to post "postconf -n" output or at least:

        default_destination_concurrency_limit
        smtp_destination_concurrency_limit

check which transport is used for this domain, and if not "smtp", post the 
concurrency limit for that.

> >>> Have you configured any concurrency controls or rate delay for this 
> >>> destination?
> 
> No. keep default unchanged.  Which parameters for concurrency controls or 
> rate delay need to be checked?

    All <transport>_destination_concurrency_limit settings in main.cf

    Any <transport>_destination_rate_delay settings in main.cf

> >>>This delay means a large number of messages waiting behind the messages 
> >>>currently being delivered, subject to concurrency and rate delays.
> 
> How can we increase delivery rate so that b-delay is down?

Either increase concurrency or reduce latency.  Network captures may show which 
protocol stage is responsible for most of the delay, even with TLS one can tell 
whether the delay is at the beginning or at the end of the TLS session or just 
low bandwidth throughout.

> How can we check the new server has a working local DNS cache?
> Check the file /etc/resolv.conf?

Yes, but also time MX, A and AAAA lookups for the destination relay.
How is the relay specified with or without surrounding "[]"?

> In peak hours 10:30:00~ 10:59:59, other servers running Postfix-2.6.6 
> on RHEL 6.4 were fine.

That's meaningless, what was their output rate?  What was their input rate?  
What was the typical "c+d" latency.  If you want help with performance problems 
you need to start gathering and crunching data, being lazy and avoiding hard 
numbers is not an option.

> Only this server running Postfix-2.6.6 on RHEL 5.10 experienced 
> serious delay.

Delay happens when the input rate exceeds the output rate.

> Do we need change some parameters to increase delivery rate or set 
> special channel/allocate fixed SMTP processes for specified outbound 
> domains?

Random parameter twiddling rarely solves congestion, but it can cause it.  
Before changing anything the reason for the congestion needs to be identified.  
The output rate looks anaemic to me, why is the output concurrency so low?

> Our outbound emails have four main destination domains to be relayed 
> to Windows FOPE.
> 
> Buckeyemail.osu.edu ---------------> mail.us.messaging.microsoft.com
> Gmail.com    --------------------------->  mail.us.messaging.microsoft.com
> Yahoo.com  ----------------------------> 
> mail.us.messaging.microsoft.com Hotmail.com 
> ---------------------------> mail.us.messaging.microsoft.com

Did you measure the output rate for all mail destined to this relay, or just 
the first domain?  The correct measurement is to aggregate counts by transport 
next-hop.  Please report output rates for all these combined, or rather all 
mail with a relay of "mail.us.messaging.microsoft.com".

-- 
        Viktor.

RE: Backlog to outsourced email provider

Reply via email to