Re: Backlog to outsourced email provider

Viktor Dukhovni Wed, 30 Apr 2014 22:12:07 -0700

On Thu, May 01, 2014 at 04:25:01AM +0000, Xie, Wei wrote:

> Monday night I fixed my logging as below.
> 
> > smtp_tls_loglevel = 1
> > smtpd_tls_loglevel = 1


Good.

> Here are throughput per 30 minutes from 8:00 through 18:00 yesterday and today

When your queue is not congested (queue backlog is negligible),
the output rate is simply equal to the input rate and does not mean
much other than that you're sending below the peak output capacity
(which is a good thing).

Therefore, while measuring output rates, you need to also determine
whether there is indeed a backlog.  Therefore, associated with all
these numbers you need to track:

    - Exponentially smoothed moving average "c+d" values.
    - Exponentially smoothed moving average "b" values.

The "c+d" values (abnormally high delivery latency) will measure
potential remote causes of congestion, while "b" value will measure
the resulting delays.

The exponential smoothing avoids undue contribution from single
message spikes and quickly forgets stale history.  For each new
delivery (again avoid double-counting multiple recipients in a
single message) apply something like the Perl snippet below:

    $alpha = 0.05;  # If you want less noisy data at the cost of not seeing
                    # some shorter-term spikes, reduce $alpha to ~0.02.
    $b_moving = (1-$alpha) * $b_moving + $alpha * $b;
    $cd_moving = (1-$alpha) * $cd_moving + $alpha * $cd;

Then print "$b_moving" and "$cd_moving" every 100 or so deliveries.

> >>>Understand and memorize this simple formula:
> >>>
> >>>   Throughput = Concurrency / Latency
> 
> If Latency = 20,

Latency has units of time, it is how long it takes to deliver a
single message, so the above makes no sense.

> Concurrency=2.8s,

Concurrency is dimensionless, it counts the number of simultaneous
deliveries.  This makes no sense.

You have to *measure* the latency (smoothed "c+d"), not guess from
a single message.  That was just a crude estimate based on the drop
of water provided to estimate the number of fish in the ocean.

>  Is this a threshold?

The output rate cannot exceed the peak concurrency divided by the
average latency.  This is only a problem if the input rate is higher
still.  For email, the solution is to first work to eliminate anomalous
latency, and then if possible increase concurrency.

    "Money can buy bandwidth, but latency is forever".
        -- John Mashey, MIPS

The latency for email delivery between well functioning systems is
often two orders of magnitude smaller than the latency when something
is wrong.  So it makes sense to first control the latency, but
physics imposes tight lower bounds, at which point if more throughput
is required, you need more concurrency, and email is delivery highly
parallelizable.

> If real throughput is approximately greater than this number,

        s/throughput/input/

> delays will obviously occurred in peak hours as below on Monday, right? 

Mail piles up when it arrives faster than it leaves.

> If we increase default_destination_concurrency_limit = 30, the
> threshold of throughput per 30 minutes will be 19,285.71 ( 30/2.8
> * 1800=19,285.71), which is greater than throughput yesterday and
> today. Does this avoid obvious delays?

The 2.8 was pulled out of a hat, you really should have a strong
impression by now that I have little patience for lazy guess-work.
Don't guess, measure!  The 2.8s number is I think way too high,
surely the provider can do better.

Perhaps your DNS is configured poorly and their lookups are slowing
down deliveries?  Or you're hitting a congested shared system that
the provider needs to make more performant.  Are you paying them
enough money to get good service?  Can someone else deliver good
service for a similar cost?

> Also, I read Postfix Performance Tuning at URL
> http://www.postfix.org/TUNING_README.html about "Tuning the number
> of simultaneous deliveries" and " Tuning the number of recipients
> per delivery".

This does not apply with "transactional" email where each message
has just one recipient.  For mail to large lists, the Postfix
default of 50 recipients per message is about right in most cases.
When virus scanning messages to large lists (content filter
transports), I used to set the recipient limit to ~1000, but both
ends of the SMTP connection where configured by me.  Remote systems
may not support much more than (and sometimes unfortunately less)
than the RFC requirement of at least 100 recipients per message.
 
> * For high volume destination, it seems we are able to increase
> default_destination_concurrency (20->30?)

*After* figuring out what the latency is, why it is, and what if
anything can be done about it.

> and lower smtp_connection_timeout (30s ->5s);

Fine on your own network, unlikely to make any difference with large
providers that use load-balancers, which almost never exhibit any
connection latency.

>     * For high volume destination, it seems we are able to increase
> default_destination_recipient_limit (50 ->100?)

Won't make any difference if each message has just one recipient.
What is the distribution of message recipient counts in your logs?

> For high volume sites a key tuning parameter is the number of
> "smtp" delivery agents allocated to the "smtp" and "relay" transports.
> High volume sites tend to send to many different destinations, many
> of which may be down or slow, so a good fraction of the available
> delivery agents will be blocked waiting for slow sites. Also mail
> destined across the globe will incur large SMTP command-response
> latencies, so high message throughput can only be achieved with
> more concurrent delivery agents. )

All your mail goes to a single relay host.  The above is about high
volume sending systems that send "direct to MX".

> and " Example 4: High volume destination backlog", including the
> following paragraph:

No need to quote QSHAPE_README at me, I wrote it. :-)

>     In master.cf set up a dedicated clone of the "smtp" transport
>     for the destination in question. In the example below we will call
>     it "fragile".

Your destination is not "fragile".  That's only needed for destinations
that get throttled due to repeated timeouts, connection failures
or the destination refusing service under load.  Your destination
is "slow", not "fragile".

> Can we divide destination domains into three transport groups
> and create two clones of  the "smtp" transport?

All your mail goes to a single relay host, there's nothing to divide.

-- 
        Viktor.

Re: Backlog to outsourced email provider

Reply via email to