On Thursday, I bit the bullet and upgraded to 3.0.0 from 2.63. The new release had been out a month, and 2.63 just wasn't doing a good job tagging the spam. My server is RedHat 8.0. PIII-1GHz with 512MB of RAM and SCSI RAID-0. I'm using "lock_method flock" in my local.cf file. I'm running qmail with vpopmail, and have a single bayes/awl for all users. This setup has worked great for a long time, processing about 3000 emails a day. I use qmail-spamc in concert with qscanq to scan messages as they come into qmail-smtpd.
I started having problems though, with spamd processes that would start chewing through huge amounts of memory. The process would get stuck on a message and then slowly build up to 100's of megabytes in use (which is significant on a machine with 512MB). The system starts swapping, the performance of other processes goes down, and then I'm in trouble. I tried upgrading to 3.0.1 (Murphy's law: as soon as you upgrade from an old version, another release will come out within a day).
For example, with 4 children (note that it also happened consistently with 2 children), the following series of events happened this morning.
2:48- each child is respawned cleanly (after hitting 200-messages)
3:13- one process gets hung up, other processes go from <5 second processing times to 30+ seconds
3:15- another process gets hung up
3:17- another process (now 3 of 4). Remaining process dutifully chugs along.
3:22- last process gets hung up
3:41- main spamd gets SIGCHLD for first hung process (out of swap?) but can't spawn new processes (SIGCHLD whenever new procs try to scan a message).
I have dutifully run 'sa-learn --force-expire'. I wonder about my whitelist though -- it's up to 315MB. Is there anything I can run to prune it down? Even my bayes_seen is only 41M.
I checked to make sure that it wasn't a malformed message causing trouble -- I was manually scanning some emails (ones that weren't scanned with spamd freaked out) and got it to hang on one of the files. But, when I reset spamd and rescanned with spamc, it was scored just fine.
I'm currently trying to run with a single child (to see if it's a file locking issue), and if that doesn't work, I guess I'll try ditching the entire whitelist to see if that helps.
Has anyone else seen this behavior? Any recommendations? Additional information I can provide that will help narrow down the problem?
--
Tom Collins - [EMAIL PROTECTED]
QmailAdmin: http://qmailadmin.sf.net/ Vpopmail: http://vpopmail.sf.net/
Info on the Sniffter hand-held Network Tester: http://sniffter.com/