Re: Default Bayes Database

David F. Skoll Fri, 10 May 2013 12:51:51 -0700

On Wed, 08 May 2013 19:32:26 +0200
Axb <axb.li...@gmail.com> wrote:

> - your HAM is somebody else's SPAM

Do you have evidence for that?  The reason I ask is that one of the
main features of our (commercial) anti-spam solution is a very large
Bayes database.  Once a night, we aggregate all the tokens from votes from
all of our customers and push out a Bayes database containing tokens for the
last 21 days from about 3.2 million spam and 3.4 million ham messages.

It works really well and we find that even our highly diverse customer
database agrees substantially on spam vs. ham.

There was a USENIX paper on this topic quite a while ago:
http://static.usenix.org/event/lisa04/tech/blosser/blosser_html/
It won the best paper award for LISA '04.

> - A decent Bayes DB is highly dynamic and yesterday's tokens from 
> someone else's traffic will be useless to you traffic, today.

Not true.  Bayes data remains relevant for several days, if not weeks or
months.

Obviously, our system *also* includes individual Bayes databases that adapt
to specific users' mail flows and updates more than once a day, but even the
daily-updated central database is surprisingly good.  (It seems that a large
sample size is the key.)

Karsten Bräckelmann wrote:

> Just try to imagine working in an industry where e.g. Viagra and
> Cialis are totally legit phrases to use...

Actually, we find that is not a problem because spammers use things
like Vi@gr@ and C1AL1S that are far more damning than the unmodified words
themselves.  Also, our Bayes implementation uses word pairs as well as
individual words which improves its selectivity.

Anyway, my main point is this: Don't dismiss a shared Bayes database
without supplying evidence that it's a bad idea. :)

Regards,

David.

Re: Default Bayes Database

Reply via email to