RW wrote:
On Wed, 5 Jun 2019 10:45:13 -0400
Kris Deugau wrote:

jim.ander...@wohosting.net wrote:
Greetings,

I've searched but haven't had any luck finding documentation about
how to determine the optimal settings for spamd children
(max-children, min-children, max-spare, min-spare, and
max-conn-per-child). I have a dedicated server for running spamd.
It has 6GB (can add more) and 6 cores. What would be the best
settings? Or how would I determine the best settings?

"Try it and see."  :/

At a minimum you'll want to make sure that you don't spawn more spamd
children than you can keep in RAM;  watch your system for a while,
take the worst-case spamd memory footprint, and divide that into your
physical RAM to find the absolute largest max-children you should
use.

You can get a more accurate feeling for that limit by stress-testing
and watching-out for significant swap I/O, but if it turns-out to be
relevant more memory may be needed.

What I did was measure the CPU limited throughput without network
tests, and then calculate the number of children needed to sustain that
throughput with a scan time on the high end of those seen.

It's a good idea to check that you can actually reach full CPU usage
and aren't running into an avoidable locking bottleneck with Bayes etc.

*nod* If you're having to fine-tune any of these to keep the system fully busy but not overloading, that's another key factor to watch; you can allow more processes than CPU cores, but unless your DNS resolver is slow, not by much. Hyperthreading may give you a bit more slack but I don't think a HT "core" really gives you a full CPU core of benefit for a workload like SA. On top of which you have the growing list of security issues just having it enabled. :/


We've also found that it's best to set max-children to
min-children+1, and max/min-spare to 1.  It may have been improved
since we last reviewed our settings in detail, but at the time spamd
did't spawn new children fast enough under load spikes,

Despite what it says in the documentation there isn't an actual rate
limit. What happens is that above 'min-free' processes the number of
children only gets incremented when a child becomes idle after
completing a scan or initializing. So the worst case is that there is a
delay of the time it takes to scan one message. Once a scan completes
the number of children can jump to 'max-free' instantaneously.

My memory is a little hazy on the specifics; it was ~8+ years ago IIRC when I was seeing problems and experimented with settings to avoid them. I don't recall any documentation regarding a *defined* limit on the child spawn rate, but in live testing at the time there certainly was one.

We were seeing on the order of 30-60s to scale up; think "single message with huge CC list", where SA is called on final delivery for each recipient. I don't recall offhand if it was "spawn new child process, wait, spawn, wait, etc" or if it was a burst as you say above, but the ultimate result was a lot of mail suddenly stuck in the inbound mail queue waiting for delivery. Prespawning the maximum number of child processes "fixed" the problem. In the worst cases IIRC it took up to about 30 minutes to clear the backlog.

It might not be required any more but I haven't seen any issues continuing to do so.

We were also having issues at the time with pathological spam using gigantic (>200K) HTML comments causing severe slowdowns, so scantime on any given spam sometimes averaged 15-20s if it returned at all. We added a second spamd instance relying almost solely on a subset of DNS rules plus a handful of local rules that we tuned to skim these off first.

The worst case happens when the system is completely idle when the
spike arrives, so it's something that's more likely to be seen in
testing than on a busy system.

Our two scanning nodes are nearly idle most of the time (usually up to about 5 active children of 70 in the main SA instance) but if a big burst of mail comes in, it can hit the current limits we have set. I'd raise them but I think at this point we'd hit a CPU limit instead; we do not have 70 CPU cores available on these systems, and they're also running ClamAV and a separate spamd instance for outbound mail scanning.

We've also scaled to allow for load balancing and taking a node down without impacting operations; while we're heavily overprovisioned for the average load, we're still a bit tight at peak load.

Setting them equal caused a deadlock of some kind IIRC.

Is there a bug report for that?

It was quite a while ago (possibly as far back as SA 3.2 - wasn't there a new forking pattern introduced around then?). I'll see if I can reproduce it with current release or trunk versions.

-kgd

Reply via email to