Re: SPAM/Phish and Ham E-mail Dataset

David F. Skoll Wed, 12 Jan 2011 18:25:40 -0800

On Wed, 12 Jan 2011 23:23:39 +0100
mouss <mo...@ml.netoyen.net> wrote:


[...]

> you need to train with _your_mail. do not train with somebody else's
> mail. one of the defence args is that attackers can't guess your
> setup. if every one of us uses the same corpus then it'll be easy for
> an attacker to get around.

That's the conventional wisdom, but it's not true.  There was a good
paper at USENIX a few years back that talked about how (surprisingly)
effective a shared Bayes database was.  Our commercial product uses a
daily-updated shared Bayes corpus and it's very effective.

They key is to have a large corpus and a fresh corpus.  Our corpus
consists of tokens from about 1.5 million messages all seen within the
last 21 days.  (We have an elaborate feedback mechanism that collects
tokens from a large number of systems and aggregates them in a fairly
privacy-preserving way.)

Regards,

David.

Re: SPAM/Phish and Ham E-mail Dataset

Reply via email to