David F. Skoll wrote: > Axb wrote: > > - your HAM is somebody else's SPAM > > Do you have evidence for that? The reason I ask is that one of the > main features of our (commercial) anti-spam solution is a very large > Bayes database. Once a night, we aggregate all the tokens from votes from > all of our customers and push out a Bayes database containing tokens for the > last 21 days from about 3.2 million spam and 3.4 million ham messages. > > It works really well and we find that even our highly diverse customer > database agrees substantially on spam vs. ham.
The weasel words "agrees substantially" is telling. If it isn't 100% with no false positives then at least one of those messages does not agree. That would be the evidence requested. I am not saying that your technique isn't useful. It is very pragmatic. I am sure it is very effective. I would probably do that myself. But it isn't 100%. And would you suggest distributing your well-averaged database to people who install SpamAssassin to as to seed their Bayes? How would that be distributed for users to use when installing SpamAssassin? And if you did how would this large corpus of learned symbols affect the smaller amount of messages the user trains with when they train-on-error? Would it swamp it by the much larger numbers? It is trouble. I think having users start with a blank slate and then start learning from their own messages makes the most sense. And users can always learn from their current mailbox of past messages so it isn't much hardship. Bob