-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Robert Menschel writes: > I'm trying to make sure my corpus is as clean as possible, eliminating > all duplicates. > > I tried to use the masses/corpora/uniq-mailbox program for this, and had > problems which I've documented in bugzilla report 2920. > > Fortunately, my email client identifies and can delete duplicates = same > message id, same from, same to, and same creation time stamp. This leaves > a lot of "duplicates" that uniq-mbox would have thrown away, but they > were issued and received on different days, or were issued with different > "from" addresses. > > So first question: If I receive an email, > message-id = <[EMAIL PROTECTED]> > from = The Savvy Investor <[EMAIL PROTECTED]> > to = [EMAIL PROTECTED] > dated Wed, 26 Nov 2003 20:53:43 01800 > and a few minutes later I receive effectively the same email, with the > same message-id, and the same from address, but > to = [EMAIL PROTECTED] > dated Wed, 26 Nov 2003 21:00:19 01800 > is that the same spam? Is it a duplicate? In the spam case, these *are* dups, because the message headers are heavily randomized. The policy is generally to remove dups in the spam corpus -- since often they are only dups because (a) the spammer had to rerun the spam-run, (b) it went to several email addresses that all wind up in one mailbox, (c) broken spamware. However if the duplication isn't very easily noticeable, don't worry about it too much -- I generally only remove dups from my personal mail corpus if they are "right beside each other", ie. noticeably sent at the same time. btw uniq-mailbox is very overaggressive; it's really only useful if you don't care about losing quite a few messages (e.g. for spamtrap cleaning). - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.2 (GNU/Linux) Comment: Exmh CVS iD8DBQFAAwl9QTcbUG5Y7woRApTvAJ4n9rGRtQOYqbeUi/BbxdxIHL4qjgCfWPZN GJcjQpAvZ+ePnv8+KAQ2NA4= =PSHo -----END PGP SIGNATURE----- ------------------------------------------------------- This SF.net email is sponsored by: Perforce Software. Perforce is the Fast Software Configuration Management System offering advanced branching capabilities and atomic changes on 50+ platforms. Free Eval! http://www.perforce.com/perforce/loadprog.html _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk