On 4/29/2010 8:25 AM, Frank Bures wrote:
> I've been running spamassassin for years.  I am using auto-learn with very
> conservative thresholds.  However, after several years of usage my spam
> database is about three time larger than my ham database and I am starting
> to see false positives.
>
> Is there a way how to "shrink" the spam database?
>
> Thanks
> Frank
>
>   
First, I assume you are refering to the "nspam" and "nham" counts
reported by sa-learn --dump magic, which are a bit misleading. Mostly
because the DB contains tokens, not messages.

Those are a reflection of the total counts of messages learned, not the
current content of the database. They never decrease, even if every
token learned from a message gets expired out of the database, it is
still counted there. (ok, technically they do decrease if you clear your
DB.. but...).

A lot of the reason for this is that SpamAssassin breaks learned
messages into dozens of "tokens", each of which has it's own life in the
database after learning. What message(s) a token was learned from is not
tracked, because it is not relevant to anything SpamAssassin does and
would take up extra space in the DB. SpamAssassin tracks how often each
token is cropping up in your mailstream, and uses that as a basis for
what's worth keeping. Over time, some, half, most or all of the tokens
learned from a particular message may get expired out due to lack of
use. But SA has no way of tracking that back to the original message(s)
and saying "ok, I've now this token which was learned from spam message
123456 from October 2009, and message 23456 from Dec 2009". Even if it
could, what should it do? Count partial messages? Only reduce the count
when all tokens from message 123456 expire?

As far as input training goes, technically 1:1 is a "perfect" ratio, but
few ever achieve that. Personally I think 3:1 is pretty good for
real-world, and I've seen plenty of DBs be as unbalanced as 20:1 and
still work quite well (let's face it, in the real world, there's a LOT
more spam than nonspam).

Are you having accuracy problems? Or are you just getting concerned
about the numbers not "looking right"?

With bayes, I'd be more concerned about the results than the "looks
right".. Reality rarely "looks right" when you try to express it
numerically, because humans behave strangely, not in nice well behaved
patterns that break down nicely.


Reply via email to