On 4/29/2010 8:25 AM, Frank Bures wrote: > I've been running spamassassin for years. I am using auto-learn with very > conservative thresholds. However, after several years of usage my spam > database is about three time larger than my ham database and I am starting > to see false positives. > > Is there a way how to "shrink" the spam database? > > Thanks > Frank > > First, I assume you are refering to the "nspam" and "nham" counts reported by sa-learn --dump magic, which are a bit misleading. Mostly because the DB contains tokens, not messages.
Those are a reflection of the total counts of messages learned, not the current content of the database. They never decrease, even if every token learned from a message gets expired out of the database, it is still counted there. (ok, technically they do decrease if you clear your DB.. but...). A lot of the reason for this is that SpamAssassin breaks learned messages into dozens of "tokens", each of which has it's own life in the database after learning. What message(s) a token was learned from is not tracked, because it is not relevant to anything SpamAssassin does and would take up extra space in the DB. SpamAssassin tracks how often each token is cropping up in your mailstream, and uses that as a basis for what's worth keeping. Over time, some, half, most or all of the tokens learned from a particular message may get expired out due to lack of use. But SA has no way of tracking that back to the original message(s) and saying "ok, I've now this token which was learned from spam message 123456 from October 2009, and message 23456 from Dec 2009". Even if it could, what should it do? Count partial messages? Only reduce the count when all tokens from message 123456 expire? As far as input training goes, technically 1:1 is a "perfect" ratio, but few ever achieve that. Personally I think 3:1 is pretty good for real-world, and I've seen plenty of DBs be as unbalanced as 20:1 and still work quite well (let's face it, in the real world, there's a LOT more spam than nonspam). Are you having accuracy problems? Or are you just getting concerned about the numbers not "looking right"? With bayes, I'd be more concerned about the results than the "looks right".. Reality rarely "looks right" when you try to express it numerically, because humans behave strangely, not in nice well behaved patterns that break down nicely.