On Thu, Jan 13, 2011 at 2:23 AM, mouss <mo...@ml.netoyen.net> wrote: > sigh. if you can't understand what "privacy" means, then you are part of > the problem.
Ham corpus "may" conflict with privacy, but it does not necessarily have to. An example is the old ~2005 ham corpus. People can decide which emails to share, and which ones to not. As long as we are not stealing emails, I am not breaking privacy. > you need to train with _your_mail. do not train with somebody else's > mail. one of the defence args is that attackers can't guess your setup. > if every one of us uses the same corpus then it'll be easy for an > attacker to get around. That "might" be acceptable if I am building a model for my use. However, in my case, I am evaluating a classifier that I have developed for everyone to use (i.e. not just me) and need to publish its performance evaluation. Using a personal dataset might produce good results since my dataset "might" be easy to cluster, or that my approach might over-fit my dataset and email patterns. To make sure that my classifier is really good for the public, I need a public corpus. **** CALL OF COLLABORATION **** I know many of you are in the industry of the fight against SPAM. If anyone is welling to share Ham emails with me (i.e. by handing emails that do not conflict with his privacy, or via an NDA), I would highly appreciate it, and will indeed acknowledge you in my publications:) ******************************************* -- Regards, Mahmoud Khonji