On Wed, 12 Jan 2011 21:25:06 -0500 "David F. Skoll" <d...@roaringpenguin.com> wrote:
> On Wed, 12 Jan 2011 23:23:39 +0100 > mouss <mo...@ml.netoyen.net> wrote: > > [...] > > > you need to train with _your_mail. do not train with somebody else's > > mail. one of the defence args is that attackers can't guess your > > setup. if every one of us uses the same corpus then it'll be easy > > for an attacker to get around. > > That's the conventional wisdom, but it's not true. There was a good > paper at USENIX a few years back that talked about how (surprisingly) > effective a shared Bayes database was. Our commercial product uses a > daily-updated shared Bayes corpus and it's very effective. I don't think that's really surprising, if you aggregate information from many sites the filter starts to pick-up the same sort of capabilities that would otherwise be provided by network tests. I think you would probably get better results by allowing local high-count tokens to override global token frequencies, and by de-emphasising hammy tokens that aren't seen locally. Is there anything to prevent spammers signing up and using your databases to autogenerate spam? It sounds like it may be the sort of technique that works until spammers take it seriously. Training from slowly changing public corpora has no advantage to set against the loss of local information, although it should be OK for testing purposes.