Piotr,

> We noticed a big increase in user CPU utilization on our MX servers since
> Sep 2nd sa-update.  On a typical day we process over 2 million emails on
> our mail cluster. Our debugging has so far isolated the problem to:

> 1) iXhash was a problem module, so we disabled it (the remote location
> it was hitting was causing a problem)

Ok.

> 2) The following RBLs spamhaus, mailspke, njabl, spameating monkey,
> were determined to be causing latency with dig@localhost queries,
> so they were removed

Turning off Spamhaus RBL is a drastic measure and should not
be done lightly. If some RBL are found to be responding slowly,
this should be investigated. A single non-responding RBL can
have a significant effect, as it can extend the DNS wait time
up to the rbl_timeout value, which by default is 15 seconds,
which I find way beyond reasonable.

See man page for Mail::SpamAssassin::Conf. The rbl_timeout setting
can provide a global default, but can also be specified individually
by zones. Here is an example (not necessarily a recommendation):

  rbl_timeout 5.5 2.2
  rbl_timeout 7.7 3.3 open-whois.org
  rbl_timeout 4       surbl.org
  rbl_timeout 3.5     dwl.spamhaus.org

Also make sure the DNS setup is fine: use a local caching DNS
resolver (on the same host or on the same LAN), avoid a remote,
and avoid a heavily utilized DNS servers, dedicate a DNS resolver
to mail checking.

> 3) dcc, pyzor and razor2 and freewebmail were disabled

The DCC is not supposed to be slow. Use a local dccifd daemon,
and set up a local server if necessary.

> 4) sa-compile was utilized to improve the systems being cpu bound

Yes, that can help by a few percent.

Similarly, re-compiling perl with a more recent version of gcc
can bring another few percents speedup.

And using amavisd 2.8.0 instead of an earlier version can bring
another few percent, as it avoids some I/O for smaller messages
(which amount to about 90 % of all messages according to our statistics).

> We have tested by turning the following flags on (to disable the services):
> 
> $skip_rbl_checks= 1 (spamassassin)
> $sa_local_tests_only = 1 (amavis hook)

This will drastically reduce the quality of SpamAssassin
classification, and should not be done lightly, except for
short periods to avoid some crisis.

> With the above we have started to see an improvement but we are still
> trying to identify the root cause.

Observe the timing statistics, and check timing for a couple of
sample message running 'spamassassin -D -t' from a command line.

Amavisd provides timing reports (the ' TIMING ' log entries)
at log level 2 or above.

Amavisd will also include SpamAssassin detailed timing report
(the ' TIMING-SA 'log entries) in its log at log level 2 or above.

The 'amavisd-nanny' (or its newer sister 'amavisd-status'
which uses a quicker 0MQ (ZeroMQ) message passing library
instead of a Berkeley DB) can also provide a valuable
first assessment on the status of all amavisd processes.

Make sure that most of the time is spent in SpamAssassin,
and if so, check its timing report to narrow down the search
for a culprit.

If RBS/DNS lookups in SpamAssassin are suspect, you can
enable the 'async' and 'timing' debug areas in SpamAssassin
by using option -d in amavisd (similar is possible in spamd),
e.g.:
  # amavisd -d async,timing

which will enable SpamAssassin debug messages for listed areas
in the amavisd log at log level 3 or above. Searching for
'SA dbg: async: timing:' in the log will yield detailed
timing report for each RBL/DNS lookup as issued by SpamAssassin.

Sometimes certain SpamAssassin rules can be much slower then
the rest. A less-then-carefully written regex in some local rule
can take a heavy toll. If you have some sample messages which
seem to be taking above average processing time, one way to
assess how much time each rule takes is to enable a plugin
HitFreqsRuleTiming.pm, which can be found in the SpamAssassin
distribution in a directory ./masses/plugins/ . When this plugin
is enabled, by the end of each mail checking it will produce
a file 'timing.log' in a current directory (which must be writable).
The file contains a sorted top-n list of slowest rules.
This plugin is not suitable for production run, but can be
very useful when a 'spamassassin' command is run on a mail sample.

Last but not least, not to forget basic amavisd / MTA tuning
knob: the number of child processes. If CPU utilization is close
to 100 % for extended periods, reducing the number of processes
can save memory and may actually increase throughput.
On the other hand, if mail processing is falling behind
and CPU utilization is not vey high and memory is available,
increasing the number of processes will help considerably.

> Our current fix is really a patch since those custom changes will be
> removed during the next nightly sa-update run.  Are there recommend ways
> to create an exclude list to allow these types of local updates to be
> included in the sa-update workflow without having to hack it directly?

Rules and settings in files like local.cf will override
the default settings and rules that come with sa-update.

> Has anyone experienced anything similar since September 2nd update?

Not really, but perhaps I wasn't paying attention.

  Mark

Reply via email to