Victor, >>You need to post "postconf -n" output or at least: >> >> default_destination_concurrency_limit >> smtp_destination_concurrency_limit >> >>check which transport is used for this domain, and if not "smtp", post the >>concurrency limit for that.
Here are the output of 'posconf -n': alias_database = dbm:/etc/aliases alias_maps = hash:/etc/aliases command_directory = /usr/sbin config_directory = /etc/postfix daemon_directory = /usr/libexec/postfix data_directory = /var/lib/postfix debug_peer_level = 2 header_checks = regexp:/etc/postfix/header_checks html_directory = no inet_interfaces = $myhostname, localhost inet_protocols = ipv4 mail_owner = postfix mailbox_size_limit = 53000000 mailq_path = /usr/bin/mailq.postfix manpage_directory = /usr/share/man message_size_limit = 52428800 mydestination = $myhostname, localhost.$mydomain, localhost mydomain = osuad.osu.edu myhostname = cio-krc-pf03.osuad.osu.edu newaliases_path = /usr/bin/newaliases.postfix queue_directory = /var/spool/postfix readme_directory = /usr/share/doc/postfix-2.6.14-documentation/readme relayhost = mail.us.messaging.microsoft.com sample_directory = /usr/share/doc/postfix-2.6.14-documentation/samples sendmail_path = /usr/sbin/sendmail.postfix setgid_group = postdrop smtp_tls_CAfile = /etc/postfix/service_certs/osu_ues/DigiCertCA.crt smtp_tls_loglevel = 2 smtp_tls_note_starttls_offer = yes smtp_tls_security_level = encrypt smtp_tls_session_cache_database = btree:/var/lib/postfix/smtp_scache smtpd_tls_CAfile = /etc/postfix/service_certs/osu_ues/DigiCertCA.crt smtpd_tls_CApath = /etc/postfix/service_certs/osu_ues smtpd_tls_cert_file = /etc/postfix/service_certs/osu_ues/OSU_UES_WC_Cert_Ex_certificate.pem smtpd_tls_key_file = /etc/postfix/service_certs/osu_ues/OSU_UES_WC_Cert_Ex_key.pem smtpd_tls_loglevel = 2 smtpd_tls_received_header = yes smtpd_tls_security_level = may smtpd_tls_session_cache_database = btree:/var/lib/postfix/smtpd_scache unknown_local_recipient_reject_code = 550 Here are the settings for the following two parameters: default_destination_concurrency_limit = 20 smtp_destination_concurrency_limit = $default_destination_concurrency_limit No transport is used so far. > >>> Have you configured any concurrency controls or rate delay for this > >>> destination? > > No. keep default unchanged. Which parameters for concurrency controls or > rate delay need to be checked? > > All <transport>_destination_concurrency_limit settings in main.cf > > Any <transport>_destination_rate_delay settings in main.cf There is no parameter defined for transport in main.cf. >>Either increase concurrency or reduce latency. Network captures may show >>which protocol stage is responsible for most of the delay, even with TLS one >>can tell whether the delay is at >>the beginning or at the end of the TLS >>session or just low bandwidth throughout. We prefer to increase concurrency. >>> How can we check the new server has a working local DNS cache? >>> Check the file /etc/resolv.conf? >> >>Yes, but also time MX, A and AAAA lookups for the destination relay. # time dig MX mail.us.messaging.microsoft.com real 0m0.014s user 0m0.002s sys 0m0.003s # time dig A mail.us.messaging.microsoft.com real 0m0.016s user 0m0.004s sys 0m0.004s # time dig AAAA mail.us.messaging.microsoft.com real 0m0.058s user 0m0.001s sys 0m0.004s >>How is the relay specified with or without surrounding "[]"? Without surrounding "[]". relayhost = mail.us.messaging.microsoft.com >> In peak hours 10:30:00~ 10:59:59, other servers running Postfix-2.6.6 >> on RHEL 6.4 were fine. > >That's meaningless, what was their output rate? What was their input rate? >What was the typical "c+d" latency. If you want help with performance >problems you need to start gathering >and crunching data, being lazy and >avoiding hard numbers is not an option. You are total correct. On this RHEL 5.10 server, today 10:30:00 ~ 10:59:59 the output rate of email to this domain in the 30 minutes was 10,928. On other 6 RHEL 6.4 servers, today 10:30:00 ~ 10:59:59 the output rate of email to this domain in the 30 minutes were 4,824 ~ 6,564. >> Our outbound emails have four main destination domains to be relayed >> to Windows FOPE. >> >> Buckeyemail.osu.edu ---------------> mail.us.messaging.microsoft.com >> Gmail.com ---------------------------> mail.us.messaging.microsoft.com >> Yahoo.com ----------------------------> mail.us.messaging.microsoft.com >>Hotmail.com ---------------------------> mail.us.messaging.microsoft.com >Did you measure the output rate for all mail destined to this relay, or just >the first domain? The correct measurement is to aggregate counts by transport >next-hop. Please report output >rates for all these combined, or rather all >mail with a relay of "mail.us.messaging.microsoft.com". We measure the output rate for all mail destined to this relay. I use your criteria to double check again and get the following output rate. The data which I gave to you in previous email is not accurate. Today 10:00:00 ~ 10:29:59 the output rate of email to this relay in the 30 minutes was 9,623. Today 10:30:00 ~ 10:59:59 the output rate of email to this relay in the 30 minutes was 10,928. the output rate of email to domain "buckeymail.osu.edu" via this relay was 6,399 the output rate of email to domain "gmail.com" via this relay was 2,803 the output rate of email to domain "yahoo.com" via this relay was 619 the output rate of email to domain "Hotmail.com" via this relay was 336 the output rate of email to other domains via this relay was 771 Today 11:00:00 ~ 11:29:59 the output rate of email to this relay in the 30 minutes was 15,597. Thanks, Carl -----Original Message----- From: owner-postfix-us...@postfix.org [mailto:owner-postfix-us...@postfix.org] On Behalf Of Viktor Dukhovni Sent: Monday, April 28, 2014 2:42 PM To: postfix-users@postfix.org Subject: Re: Backlog to outsourced email provider On Mon, Apr 28, 2014 at 06:09:43PM +0000, Xie, Wei wrote: > When congestion occurred, all other messages to the same domain were > similarly delayed. The delay are longer and longer (the longest > exceeded 1200 seconds) and the length of active queue is longer and > longer (get to know from the outputs from commands 'qshape active' and > 'mailq |grep \* |wc -l, the queued messages were over 9,000). Clearly the output rate is not keeping up with the input rate. > >>> What is the complete set of logs for this queue-id? > > Apr 28 10:47:11 cio-krc-pf03 postfix/smtpd[31853]: 9934181190: > client=cio-tnc-ht06.osuad.osu.edu[164.107.81.171] > Apr 28 10:47:11 cio-krc-pf03 postfix/qmgr[31812]: 9934181190: > from=<erequest.do.not.re...@osu.edu>, size=1905, nrcpt=1 (queue > active) Apr 28 11:03:18 cio-krc-pf03 postfix/smtp[5015]: 9934181190: > to=<turek...@buckeyemail.osu.edu>, > relay=mail.us.messaging.microsoft.com[216.32.181.178]:25, delay=967, > delays=0/964/1.5/1.3, dsn=2.6.0, status=sent (250 2.6.0 > <27520027.166481398696426625.javamail.erequest.do.not.re...@osu.edu> > [InternalId=9787221] Queued mail for delivery) Apr 28 11:03:18 > cio-krc-pf03 postfix/qmgr[31812]: 9934181190: removed Yes indeed nothing seems to happen for 964 seconds sitting in the queue. > >>>What was the output rate of email to this domain in the 30 minutes > >>>preceeding this log entry? (Avoid counting multiple recipients > >>>with the same queue-id, relay and remote server response as > >>>separate deliveries). > > Today 10:00:00 ~ 10:29:59 the output rate of email to this domain in the 30 > minutes was 15,361. > Today 10:30:00 ~ 10:59:59 the output rate of email to this domain in the 30 > minutes was 28,827. > Today 11:00:00 ~ 11:29:59 the output rate of email to this domain in the 30 > minutes was 111,27. Can you clarify that last one, is that ~11 thousand, the comma seems misplaced. The earlier rate appears to be ~100 messages per minute, or just over 1 per second. You should measure the average "c+d" in the log for these time frames, again counting multiple recipients in a single delivery as one event. Supposing the 2.8 second delivery latency to typical, a delivery rate of 1-2 messages per second suggests a destination concurrency limit of "2", rather than the default limit of 20. You need to post "postconf -n" output or at least: default_destination_concurrency_limit smtp_destination_concurrency_limit check which transport is used for this domain, and if not "smtp", post the concurrency limit for that. > >>> Have you configured any concurrency controls or rate delay for this > >>> destination? > > No. keep default unchanged. Which parameters for concurrency controls or > rate delay need to be checked? All <transport>_destination_concurrency_limit settings in main.cf Any <transport>_destination_rate_delay settings in main.cf > >>>This delay means a large number of messages waiting behind the messages > >>>currently being delivered, subject to concurrency and rate delays. > > How can we increase delivery rate so that b-delay is down? Either increase concurrency or reduce latency. Network captures may show which protocol stage is responsible for most of the delay, even with TLS one can tell whether the delay is at the beginning or at the end of the TLS session or just low bandwidth throughout. > How can we check the new server has a working local DNS cache? > Check the file /etc/resolv.conf? Yes, but also time MX, A and AAAA lookups for the destination relay. How is the relay specified with or without surrounding "[]"? > In peak hours 10:30:00~ 10:59:59, other servers running Postfix-2.6.6 > on RHEL 6.4 were fine. That's meaningless, what was their output rate? What was their input rate? What was the typical "c+d" latency. If you want help with performance problems you need to start gathering and crunching data, being lazy and avoiding hard numbers is not an option. > Only this server running Postfix-2.6.6 on RHEL 5.10 experienced > serious delay. Delay happens when the input rate exceeds the output rate. > Do we need change some parameters to increase delivery rate or set > special channel/allocate fixed SMTP processes for specified outbound > domains? Random parameter twiddling rarely solves congestion, but it can cause it. Before changing anything the reason for the congestion needs to be identified. The output rate looks anaemic to me, why is the output concurrency so low? > Our outbound emails have four main destination domains to be relayed > to Windows FOPE. > > Buckeyemail.osu.edu ---------------> mail.us.messaging.microsoft.com > Gmail.com ---------------------------> mail.us.messaging.microsoft.com > Yahoo.com ----------------------------> > mail.us.messaging.microsoft.com Hotmail.com > ---------------------------> mail.us.messaging.microsoft.com Did you measure the output rate for all mail destined to this relay, or just the first domain? The correct measurement is to aggregate counts by transport next-hop. Please report output rates for all these combined, or rather all mail with a relay of "mail.us.messaging.microsoft.com". -- Viktor.