On Mon, Apr 28, 2014 at 06:09:43PM +0000, Xie, Wei wrote: > When congestion occurred, all other messages to the same domain > were similarly delayed. The delay are longer and longer (the > longest exceeded 1200 seconds) and the length of active queue is > longer and longer (get to know from the outputs from commands > 'qshape active' and 'mailq |grep \* |wc -l, the queued messages > were over 9,000).
Clearly the output rate is not keeping up with the input rate. > >>> What is the complete set of logs for this queue-id? > > Apr 28 10:47:11 cio-krc-pf03 postfix/smtpd[31853]: 9934181190: > client=cio-tnc-ht06.osuad.osu.edu[164.107.81.171] > Apr 28 10:47:11 cio-krc-pf03 postfix/qmgr[31812]: 9934181190: > from=<erequest.do.not.re...@osu.edu>, size=1905, nrcpt=1 (queue active) > Apr 28 11:03:18 cio-krc-pf03 postfix/smtp[5015]: 9934181190: > to=<turek...@buckeyemail.osu.edu>, > relay=mail.us.messaging.microsoft.com[216.32.181.178]:25, delay=967, > delays=0/964/1.5/1.3, dsn=2.6.0, status=sent (250 2.6.0 > <27520027.166481398696426625.javamail.erequest.do.not.re...@osu.edu> > [InternalId=9787221] Queued mail for delivery) > Apr 28 11:03:18 cio-krc-pf03 postfix/qmgr[31812]: 9934181190: removed Yes indeed nothing seems to happen for 964 seconds sitting in the queue. > >>>What was the output rate of email to this domain in the 30 > >>> minutes preceeding this log entry? (Avoid counting multiple > >>> recipients with the same queue-id, relay and remote server > >>> response as separate deliveries). > > Today 10:00:00 ~ 10:29:59 the output rate of email to this domain in the 30 > minutes was 15,361. > Today 10:30:00 ~ 10:59:59 the output rate of email to this domain in the 30 > minutes was 28,827. > Today 11:00:00 ~ 11:29:59 the output rate of email to this domain in the 30 > minutes was 111,27. Can you clarify that last one, is that ~11 thousand, the comma seems misplaced. The earlier rate appears to be ~100 messages per minute, or just over 1 per second. You should measure the average "c+d" in the log for these time frames, again counting multiple recipients in a single delivery as one event. Supposing the 2.8 second delivery latency to typical, a delivery rate of 1-2 messages per second suggests a destination concurrency limit of "2", rather than the default limit of 20. You need to post "postconf -n" output or at least: default_destination_concurrency_limit smtp_destination_concurrency_limit check which transport is used for this domain, and if not "smtp", post the concurrency limit for that. > >>> Have you configured any concurrency controls or rate delay for this > >>> destination? > > No. keep default unchanged. Which parameters for concurrency controls or > rate delay need to be checked? All <transport>_destination_concurrency_limit settings in main.cf Any <transport>_destination_rate_delay settings in main.cf > >>>This delay means a large number of messages waiting behind the messages > >>>currently being delivered, subject to concurrency and rate delays. > > How can we increase delivery rate so that b-delay is down? Either increase concurrency or reduce latency. Network captures may show which protocol stage is responsible for most of the delay, even with TLS one can tell whether the delay is at the beginning or at the end of the TLS session or just low bandwidth throughout. > How can we check the new server has a working local DNS cache? > Check the file /etc/resolv.conf? Yes, but also time MX, A and AAAA lookups for the destination relay. How is the relay specified with or without surrounding "[]"? > In peak hours 10:30:00~ 10:59:59, other servers running Postfix-2.6.6 > on RHEL 6.4 were fine. That's meaningless, what was their output rate? What was their input rate? What was the typical "c+d" latency. If you want help with performance problems you need to start gathering and crunching data, being lazy and avoiding hard numbers is not an option. > Only this server running Postfix-2.6.6 on RHEL 5.10 experienced serious > delay. Delay happens when the input rate exceeds the output rate. > Do we need change some parameters to increase delivery rate or > set special channel/allocate fixed SMTP processes for specified > outbound domains? Random parameter twiddling rarely solves congestion, but it can cause it. Before changing anything the reason for the congestion needs to be identified. The output rate looks anaemic to me, why is the output concurrency so low? > Our outbound emails have four main destination domains to be > relayed to Windows FOPE. > > Buckeyemail.osu.edu ---------------> mail.us.messaging.microsoft.com > Gmail.com ---------------------------> mail.us.messaging.microsoft.com > Yahoo.com ----------------------------> mail.us.messaging.microsoft.com > Hotmail.com ---------------------------> mail.us.messaging.microsoft.com Did you measure the output rate for all mail destined to this relay, or just the first domain? The correct measurement is to aggregate counts by transport next-hop. Please report output rates for all these combined, or rather all mail with a relay of "mail.us.messaging.microsoft.com". -- Viktor.