Victor,

>>Understand and memorize this simple formula:
>>
>>      Throughput = Concurrency / Latency

>If Latency = 20, Concurrency=2.8s, Throughput=7.14286/second, which is equal 
>to 12,857/30minutes. Is this a threshold? If real throughput is approximately 
>greater than this number, delays >will obviously occurred in peak hours as 
>below on Monday, right?

It is typo because of I was tired last midnight.  Should be:

If Latency = 2.8s, Concurrency=20, Throughput=7.14286/second, which is equal to 
12,857/30minutes. Is this a threshold? If real throughput is approximately 
greater than this number, delays will obviously occurred in peak hours as below 
on Monday, right?

>Therefore, while measuring output rates, you need to also determine whether 
>there is indeed a backlog.  Therefore, associated with all these numbers you 
>need to track:
>
>    - Exponentially smoothed moving average "c+d" values.
>    - Exponentially smoothed moving average "b" values.
>
>The "c+d" values (abnormally high delivery latency) will measure potential 
>remote causes of congestion, while "b" value will measure the resulting delays.

Actually I am doing this. Once I complete, I will post.

>The exponential smoothing avoids undue contribution from single message spikes 
>and quickly forgets stale history.  For each new delivery (again avoid 
>double-counting multiple recipients >in a single message) apply something like 
>the Perl snippet below:
>
>    $alpha = 0.05;  # If you want less noisy data at the cost of not seeing
>                   # some shorter-term spikes, reduce $alpha to ~0.02.
>   $b_moving = (1-$alpha) * $b_moving + $alpha * $b;
>    $cd_moving = (1-$alpha) * $cd_moving + $alpha * $cd;
>
>Then print "$b_moving" and "$cd_moving" every 100 or so deliveries.

The initial values of $b_moving and $cd_moving should be zero, right?

>> If we increase default_destination_concurrency_limit = 30, the 
>> threshold of throughput per 30 minutes will be 19,285.71 ( 30/2.8
>> * 1800=19,285.71), which is greater than throughput yesterday and 
>> today. Does this avoid obvious delays?
>
>The 2.8 was pulled out of a hat, you really should have a strong impression by 
>now that I have little patience for lazy guess-work.
>Don't guess, measure!  The 2.8s number is I think way too high, surely the 
>provider can do better.

2.8s was first given by your email below at Monday 7:46pm. I just use it to do 
a calculation to ask my question. Through these days I have been working until 
mid-night and wriate small scripts to scan the logs to do measure work.

==============================================================
> Today 10:00:00 ~ 10:29:59 the output rate of email to this relay in 
> the 30 minutes was 9,623.
> 
> Today 10:30:00 ~ 10:59:59 the output rate of email to this relay in 
> the 30 minutes was 10,928.

That's more like it: Throughput * Latency = Concurrency

    10928 / 1800 * 2.8 = 16.8

So with latencies around 2.8 seconds your estimate concurrency is
~17 which is close enough to 20.  The problem is either that your syslogd is 
overwhelmed and too slow or the vendor service is too slow.
Fix the first problem first.


> Today 11:00:00 ~ 11:29:59 the output rate of email to this relay in 
> the 30 minutes was 15,597.

    15597 / 1800 * 2.8 = 22.4
===========================================

Thanks a lot!!!

Carl

-----Original Message-----
From: owner-postfix-us...@postfix.org [mailto:owner-postfix-us...@postfix.org] 
On Behalf Of Viktor Dukhovni
Sent: Thursday, May 01, 2014 1:11 AM
To: postfix-users@postfix.org
Subject: Re: Backlog to outsourced email provider

On Thu, May 01, 2014 at 04:25:01AM +0000, Xie, Wei wrote:

> Monday night I fixed my logging as below.
> 
> > smtp_tls_loglevel = 1
> > smtpd_tls_loglevel = 1

Good.

> Here are throughput per 30 minutes from 8:00 through 18:00 yesterday 
> and today

When your queue is not congested (queue backlog is negligible), the output rate 
is simply equal to the input rate and does not mean much other than that you're 
sending below the peak output capacity (which is a good thing).

Therefore, while measuring output rates, you need to also determine whether 
there is indeed a backlog.  Therefore, associated with all these numbers you 
need to track:

    - Exponentially smoothed moving average "c+d" values.
    - Exponentially smoothed moving average "b" values.

The "c+d" values (abnormally high delivery latency) will measure potential 
remote causes of congestion, while "b" value will measure the resulting delays.

The exponential smoothing avoids undue contribution from single message spikes 
and quickly forgets stale history.  For each new delivery (again avoid 
double-counting multiple recipients in a single message) apply something like 
the Perl snippet below:

    $alpha = 0.05;  # If you want less noisy data at the cost of not seeing
                    # some shorter-term spikes, reduce $alpha to ~0.02.
    $b_moving = (1-$alpha) * $b_moving + $alpha * $b;
    $cd_moving = (1-$alpha) * $cd_moving + $alpha * $cd;

Then print "$b_moving" and "$cd_moving" every 100 or so deliveries.

> >>>Understand and memorize this simple formula:
> >>>
> >>>   Throughput = Concurrency / Latency
> 
> If Latency = 20,

Latency has units of time, it is how long it takes to deliver a single message, 
so the above makes no sense.

> Concurrency=2.8s,

Concurrency is dimensionless, it counts the number of simultaneous deliveries.  
This makes no sense.

You have to *measure* the latency (smoothed "c+d"), not guess from a single 
message.  That was just a crude estimate based on the drop of water provided to 
estimate the number of fish in the ocean.

>  Is this a threshold?

The output rate cannot exceed the peak concurrency divided by the average 
latency.  This is only a problem if the input rate is higher still.  For email, 
the solution is to first work to eliminate anomalous latency, and then if 
possible increase concurrency.

    "Money can buy bandwidth, but latency is forever".
        -- John Mashey, MIPS

The latency for email delivery between well functioning systems is often two 
orders of magnitude smaller than the latency when something is wrong.  So it 
makes sense to first control the latency, but physics imposes tight lower 
bounds, at which point if more throughput is required, you need more 
concurrency, and email is delivery highly parallelizable.

> If real throughput is approximately greater than this number,

        s/throughput/input/

> delays will obviously occurred in peak hours as below on Monday, right? 

Mail piles up when it arrives faster than it leaves.

> If we increase default_destination_concurrency_limit = 30, the 
> threshold of throughput per 30 minutes will be 19,285.71 ( 30/2.8
> * 1800=19,285.71), which is greater than throughput yesterday and 
> today. Does this avoid obvious delays?

The 2.8 was pulled out of a hat, you really should have a strong impression by 
now that I have little patience for lazy guess-work.
Don't guess, measure!  The 2.8s number is I think way too high, surely the 
provider can do better.

Perhaps your DNS is configured poorly and their lookups are slowing down 
deliveries?  Or you're hitting a congested shared system that the provider 
needs to make more performant.  Are you paying them enough money to get good 
service?  Can someone else deliver good service for a similar cost?

> Also, I read Postfix Performance Tuning at URL 
> http://www.postfix.org/TUNING_README.html about "Tuning the number of 
> simultaneous deliveries" and " Tuning the number of recipients per 
> delivery".

This does not apply with "transactional" email where each message has just one 
recipient.  For mail to large lists, the Postfix default of 50 recipients per 
message is about right in most cases.
When virus scanning messages to large lists (content filter transports), I used 
to set the recipient limit to ~1000, but both ends of the SMTP connection where 
configured by me.  Remote systems may not support much more than (and sometimes 
unfortunately less) than the RFC requirement of at least 100 recipients per 
message.
 
> * For high volume destination, it seems we are able to increase 
> default_destination_concurrency (20->30?)

*After* figuring out what the latency is, why it is, and what if anything can 
be done about it.

> and lower smtp_connection_timeout (30s ->5s);

Fine on your own network, unlikely to make any difference with large providers 
that use load-balancers, which almost never exhibit any connection latency.

>     * For high volume destination, it seems we are able to increase 
> default_destination_recipient_limit (50 ->100?)

Won't make any difference if each message has just one recipient.
What is the distribution of message recipient counts in your logs?

> For high volume sites a key tuning parameter is the number of "smtp" 
> delivery agents allocated to the "smtp" and "relay" transports.
> High volume sites tend to send to many different destinations, many of 
> which may be down or slow, so a good fraction of the available 
> delivery agents will be blocked waiting for slow sites. Also mail 
> destined across the globe will incur large SMTP command-response 
> latencies, so high message throughput can only be achieved with more 
> concurrent delivery agents. )

All your mail goes to a single relay host.  The above is about high volume 
sending systems that send "direct to MX".

> and " Example 4: High volume destination backlog", including the 
> following paragraph:

No need to quote QSHAPE_README at me, I wrote it. :-)

>     In master.cf set up a dedicated clone of the "smtp" transport
>     for the destination in question. In the example below we will call
>     it "fragile".

Your destination is not "fragile".  That's only needed for destinations that 
get throttled due to repeated timeouts, connection failures or the destination 
refusing service under load.  Your destination is "slow", not "fragile".

> Can we divide destination domains into three transport groups and 
> create two clones of  the "smtp" transport?

All your mail goes to a single relay host, there's nothing to divide.

-- 
        Viktor.


Reply via email to