On Wed, 12 Jan 2011 23:23:39 +0100 mouss <mo...@ml.netoyen.net> wrote:
[...] > you need to train with _your_mail. do not train with somebody else's > mail. one of the defence args is that attackers can't guess your > setup. if every one of us uses the same corpus then it'll be easy for > an attacker to get around. That's the conventional wisdom, but it's not true. There was a good paper at USENIX a few years back that talked about how (surprisingly) effective a shared Bayes database was. Our commercial product uses a daily-updated shared Bayes corpus and it's very effective. They key is to have a large corpus and a fresh corpus. Our corpus consists of tokens from about 1.5 million messages all seen within the last 21 days. (We have an elaborate feedback mechanism that collects tokens from a large number of systems and aggregates them in a fairly privacy-preserving way.) Regards, David.