Re: SPAM/Phish and Ham E-mail Dataset

Mahmoud Khonji Fri, 14 Jan 2011 14:30:16 -0800

On Thu, Jan 13, 2011 at 2:23 AM, mouss <mo...@ml.netoyen.net> wrote:
> sigh. if you can't understand what "privacy" means, then you are part of
> the problem.


Ham corpus "may" conflict with privacy, but it does not necessarily
have to. An example is the old ~2005 ham corpus. People can decide
which emails to share, and which ones to not. As long as we are not
stealing emails, I am not breaking privacy.

> you need to train with _your_mail. do not train with somebody else's
> mail. one of the defence args is that attackers can't guess your setup.
> if every one of us uses the same corpus then it'll be easy for an
> attacker to get around.

That "might" be acceptable if I am building a model for my use.
However, in my case, I am evaluating a classifier that I have
developed for everyone to use (i.e. not just me) and need to publish
its performance evaluation. Using a personal dataset might produce
good results since my dataset "might" be easy to cluster, or that my
approach might over-fit my dataset and email patterns. To make sure
that my classifier is really good for the public, I need a public
corpus.


**** CALL OF COLLABORATION ****
I know many of you are in the industry of the fight against SPAM. If
anyone is welling to share Ham emails with me (i.e. by handing emails
that do not conflict with his privacy, or via an NDA), I would highly
appreciate it, and will indeed acknowledge you in my publications:)
*******************************************

--
Regards,
Mahmoud Khonji

Re: SPAM/Phish and Ham E-mail Dataset

Reply via email to