Piotr, > We noticed a big increase in user CPU utilization on our MX servers since > Sep 2nd sa-update. On a typical day we process over 2 million emails on > our mail cluster. Our debugging has so far isolated the problem to:
> 1) iXhash was a problem module, so we disabled it (the remote location > it was hitting was causing a problem) Ok. > 2) The following RBLs spamhaus, mailspke, njabl, spameating monkey, > were determined to be causing latency with dig@localhost queries, > so they were removed Turning off Spamhaus RBL is a drastic measure and should not be done lightly. If some RBL are found to be responding slowly, this should be investigated. A single non-responding RBL can have a significant effect, as it can extend the DNS wait time up to the rbl_timeout value, which by default is 15 seconds, which I find way beyond reasonable. See man page for Mail::SpamAssassin::Conf. The rbl_timeout setting can provide a global default, but can also be specified individually by zones. Here is an example (not necessarily a recommendation): rbl_timeout 5.5 2.2 rbl_timeout 7.7 3.3 open-whois.org rbl_timeout 4 surbl.org rbl_timeout 3.5 dwl.spamhaus.org Also make sure the DNS setup is fine: use a local caching DNS resolver (on the same host or on the same LAN), avoid a remote, and avoid a heavily utilized DNS servers, dedicate a DNS resolver to mail checking. > 3) dcc, pyzor and razor2 and freewebmail were disabled The DCC is not supposed to be slow. Use a local dccifd daemon, and set up a local server if necessary. > 4) sa-compile was utilized to improve the systems being cpu bound Yes, that can help by a few percent. Similarly, re-compiling perl with a more recent version of gcc can bring another few percents speedup. And using amavisd 2.8.0 instead of an earlier version can bring another few percent, as it avoids some I/O for smaller messages (which amount to about 90 % of all messages according to our statistics). > We have tested by turning the following flags on (to disable the services): > > $skip_rbl_checks= 1 (spamassassin) > $sa_local_tests_only = 1 (amavis hook) This will drastically reduce the quality of SpamAssassin classification, and should not be done lightly, except for short periods to avoid some crisis. > With the above we have started to see an improvement but we are still > trying to identify the root cause. Observe the timing statistics, and check timing for a couple of sample message running 'spamassassin -D -t' from a command line. Amavisd provides timing reports (the ' TIMING ' log entries) at log level 2 or above. Amavisd will also include SpamAssassin detailed timing report (the ' TIMING-SA 'log entries) in its log at log level 2 or above. The 'amavisd-nanny' (or its newer sister 'amavisd-status' which uses a quicker 0MQ (ZeroMQ) message passing library instead of a Berkeley DB) can also provide a valuable first assessment on the status of all amavisd processes. Make sure that most of the time is spent in SpamAssassin, and if so, check its timing report to narrow down the search for a culprit. If RBS/DNS lookups in SpamAssassin are suspect, you can enable the 'async' and 'timing' debug areas in SpamAssassin by using option -d in amavisd (similar is possible in spamd), e.g.: # amavisd -d async,timing which will enable SpamAssassin debug messages for listed areas in the amavisd log at log level 3 or above. Searching for 'SA dbg: async: timing:' in the log will yield detailed timing report for each RBL/DNS lookup as issued by SpamAssassin. Sometimes certain SpamAssassin rules can be much slower then the rest. A less-then-carefully written regex in some local rule can take a heavy toll. If you have some sample messages which seem to be taking above average processing time, one way to assess how much time each rule takes is to enable a plugin HitFreqsRuleTiming.pm, which can be found in the SpamAssassin distribution in a directory ./masses/plugins/ . When this plugin is enabled, by the end of each mail checking it will produce a file 'timing.log' in a current directory (which must be writable). The file contains a sorted top-n list of slowest rules. This plugin is not suitable for production run, but can be very useful when a 'spamassassin' command is run on a mail sample. Last but not least, not to forget basic amavisd / MTA tuning knob: the number of child processes. If CPU utilization is close to 100 % for extended periods, reducing the number of processes can save memory and may actually increase throughput. On the other hand, if mail processing is falling behind and CPU utilization is not vey high and memory is available, increasing the number of processes will help considerably. > Our current fix is really a patch since those custom changes will be > removed during the next nightly sa-update run. Are there recommend ways > to create an exclude list to allow these types of local updates to be > included in the sa-update workflow without having to hack it directly? Rules and settings in files like local.cf will override the default settings and rules that come with sa-update. > Has anyone experienced anything similar since September 2nd update? Not really, but perhaps I wasn't paying attention. Mark