I'm trying to make sure my corpus is as clean as possible, eliminating all duplicates.
I tried to use the masses/corpora/uniq-mailbox program for this, and had problems which I've documented in bugzilla report 2920. Fortunately, my email client identifies and can delete duplicates = same message id, same from, same to, and same creation time stamp. This leaves a lot of "duplicates" that uniq-mbox would have thrown away, but they were issued and received on different days, or were issued with different "from" addresses. So first question: If I receive an email, message-id = <[EMAIL PROTECTED]> from = The Savvy Investor <[EMAIL PROTECTED]> to = [EMAIL PROTECTED] dated Wed, 26 Nov 2003 20:53:43 01800 and a few minutes later I receive effectively the same email, with the same message-id, and the same from address, but to = [EMAIL PROTECTED] dated Wed, 26 Nov 2003 21:00:19 01800 is that the same spam? Is it a duplicate? Similarly, if on Dec 29 I receive an email message-id = <[EMAIL PROTECTED]> from = Lily Wall <[EMAIL PROTECTED]> to = [EMAIL PROTECTED] dated Sat, 20 Dec 2003 03:22:26 +1100 and then on Dec 31 I receive an effectively similar email with the same message-id and time stamp, but from = Monica <[EMAIL PROTECTED]> to = [EMAIL PROTECTED] and then on Jan 9 I receive an effectively similar email with the same message-id and time stamp, but from = Gerry R. Weeks <[EMAIL PROTECTED]> to = a list of seven obsolete email addresses at the same ISP (which changed its domain name over a year ago), are these the same spam? Or are they duplicates? Even more challenging: compare the following three emails: From: "Enormous Health Newsletter" <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] <[EMAIL PROTECTED]> Subject: Finally Something for Women balfaq.maquino! Date: Tue, 30 Dec 2003 16:33:24 -0800 Message-Id: <[EMAIL PROTECTED]> From: "Fascinating Daily News" <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] <[EMAIL PROTECTED]> Subject: Open her floodgates of passion Date: Sat, 20 Dec 2003 02:50:15 -0800 Message-Id: <[EMAIL PROTECTED]> From: "Valuable Daily News" <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] <[EMAIL PROTECTED]> Subject: Give her unlimited orgasms! Date: Fri, 26 Dec 2003 10:39:49 -0800 Message-Id: <[EMAIL PROTECTED]> From: "Sensational Daily Savings" <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] <[EMAIL PROTECTED]> Subject: Act now while rates are still low! Date: Sun, 04 Jan 2004 09:13:23 -0800 Message-Id: <[EMAIL PROTECTED]> The destination email address is all the same, and the message ID is identical up to the "@". These emails seem to even cover different topics. How do we identify which of these are duplicates and which are not? At this time I'm trusting my email client, and if there is any difference in the email's header (from, to, date), I'm treat this as NOT a duplicate. But I'd like to hear what other people think about these situations. Bob Menschel ------------------------------------------------------- This SF.net email is sponsored by: Perforce Software. Perforce is the Fast Software Configuration Management System offering advanced branching capabilities and atomic changes on 50+ platforms. Free Eval! http://www.perforce.com/perforce/loadprog.html _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk