On Jan 26, 2007, at 6:09 AM, Jack Gostl wrote:

The amount of spam getting through my filters has been steadily increasing. From a start of under two percent up to over ten percent. It was getting pretty bad, so I finally, just on a hunch, I wiped my Bayes files and rebuilt them. And, voila!, I'm now running under one percent.

Has anyone else seen this? Are there any suggestions as to how to deal with this? Should I regularly rebuild the bayes files?

Appreciate any advice.

Jack



I will attempt to answer your question from someone who has almost zero experience with SpamAssassin but years of experience with Bayesian filters.

You should not have to regularly rebuild your files. This is kind of contrary to the whole notion of statistical filtering and if true may indicate that SA has a problem with their approach.
However you may have an issue with how you conduct your training.
It is this area where I might have some useful information to share...

As I understand it, which means I could be wrong, SpamAssassin will learn from past emails only when it's sufficiently good/bad and ignore the grey areas in between. This means that your bayesian filter is only going to be able to pick up intelligence on the obvious spam and ignore all those in a grey area (just barely spam).

Options are:
greater reinforcement of learning what is spam/ham through user feedback. Take everything that is in this grey area and instruct SA on the good/bad status. This is really a refinement of the current autolearn or "train on everything". But again, I could be wrong about how SA deals with Bayesian learning.

"train on error" Once you get a sufficient database of tokens, only train those that you specifically identify as an error, disabling the auto-learn aspect of SA. This keeps the database small and prevents (or minimizes) database poisoning.

"train to exhaustion" which means once you tag an email as incorrect you keep training the database until the database can score it correctly. You might have to refeed an email many times into the database. But I don't think SA will even allow you to do this. Other Bayesian classifiers will.

My personal experience with other Bayesian classifiers has been the 'train on error" to be extremely effective over a long period of time with a minimal impact on the database/performance of the applications.

Reply via email to