On Jan 26, 2007, at 6:09 AM, Jack Gostl wrote:
The amount of spam getting through my filters has been steadily
increasing. From a start of under two percent up to over ten
percent. It was getting pretty bad, so I finally, just on a hunch,
I wiped my Bayes files and rebuilt them. And, voila!, I'm now
running under one percent.
Has anyone else seen this? Are there any suggestions as to how to
deal with this? Should I regularly rebuild the bayes files?
Appreciate any advice.
Jack
I will attempt to answer your question from someone who has almost
zero experience with SpamAssassin but years of experience with
Bayesian filters.
You should not have to regularly rebuild your files. This is kind of
contrary to the whole notion of statistical filtering and if true may
indicate that SA has a problem with their approach.
However you may have an issue with how you conduct your training.
It is this area where I might have some useful information to share...
As I understand it, which means I could be wrong, SpamAssassin will
learn from past emails only when it's sufficiently good/bad and
ignore the grey areas in between.
This means that your bayesian filter is only going to be able to pick
up intelligence on the obvious spam and ignore all those in a grey
area (just barely spam).
Options are:
greater reinforcement of learning what is spam/ham through user
feedback. Take everything that is in this grey area and instruct SA
on the good/bad status. This is really a refinement of the current
autolearn or "train on everything". But again, I could be wrong
about how SA deals with Bayesian learning.
"train on error" Once you get a sufficient database of tokens, only
train those that you specifically identify as an error, disabling the
auto-learn aspect of SA. This keeps the database small and prevents
(or minimizes) database poisoning.
"train to exhaustion" which means once you tag an email as incorrect
you keep training the database until the database can score it
correctly. You might have to refeed an email many times into the
database. But I don't think SA will even allow you to do this.
Other Bayesian classifiers will.
My personal experience with other Bayesian classifiers has been the
'train on error" to be extremely effective over a long period of time
with a minimal impact on the database/performance of the applications.