Re: SPAM/Phish and Ham E-mail Dataset

RW Thu, 13 Jan 2011 05:51:56 -0800

On Wed, 12 Jan 2011 21:25:06 -0500
"David F. Skoll" <d...@roaringpenguin.com> wrote:

> On Wed, 12 Jan 2011 23:23:39 +0100
> mouss <mo...@ml.netoyen.net> wrote:
> 
> [...]
> 
> > you need to train with _your_mail. do not train with somebody else's
> > mail. one of the defence args is that attackers can't guess your
> > setup. if every one of us uses the same corpus then it'll be easy
> > for an attacker to get around.
> 
> That's the conventional wisdom, but it's not true.  There was a good
> paper at USENIX a few years back that talked about how (surprisingly)
> effective a shared Bayes database was.  Our commercial product uses a
> daily-updated shared Bayes corpus and it's very effective.

I don't think that's really surprising, if you aggregate information
from many sites the filter starts to pick-up the same sort of
capabilities that would otherwise be provided by network tests. I think
you would probably get better results by allowing local high-count
tokens to override global token frequencies, and by de-emphasising
hammy tokens that aren't seen locally.

Is there anything to prevent spammers signing up and using your
databases to autogenerate spam? It sounds like it may be the sort of
technique that works until spammers take it seriously.

Training from slowly changing public corpora has no advantage to set
against the loss of local information, although it should be OK for
testing purposes.

Re: SPAM/Phish and Ham E-mail Dataset

Reply via email to