On Wed, 08 May 2013 19:32:26 +0200 Axb <axb.li...@gmail.com> wrote: > - your HAM is somebody else's SPAM
Do you have evidence for that? The reason I ask is that one of the main features of our (commercial) anti-spam solution is a very large Bayes database. Once a night, we aggregate all the tokens from votes from all of our customers and push out a Bayes database containing tokens for the last 21 days from about 3.2 million spam and 3.4 million ham messages. It works really well and we find that even our highly diverse customer database agrees substantially on spam vs. ham. There was a USENIX paper on this topic quite a while ago: http://static.usenix.org/event/lisa04/tech/blosser/blosser_html/ It won the best paper award for LISA '04. > - A decent Bayes DB is highly dynamic and yesterday's tokens from > someone else's traffic will be useless to you traffic, today. Not true. Bayes data remains relevant for several days, if not weeks or months. Obviously, our system *also* includes individual Bayes databases that adapt to specific users' mail flows and updates more than once a day, but even the daily-updated central database is surprisingly good. (It seems that a large sample size is the key.) Karsten Bräckelmann wrote: > Just try to imagine working in an industry where e.g. Viagra and > Cialis are totally legit phrases to use... Actually, we find that is not a problem because spammers use things like Vi@gr@ and C1AL1S that are far more damning than the unmodified words themselves. Also, our Bayes implementation uses word pairs as well as individual words which improves its selectivity. Anyway, my main point is this: Don't dismiss a shared Bayes database without supplying evidence that it's a bad idea. :) Regards, David.