-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Menschel writes:
> I'm trying to make sure my corpus is as clean as possible, eliminating
> all duplicates.
> 
> I tried to use the masses/corpora/uniq-mailbox program for this, and had
> problems which I've documented in bugzilla report 2920.
> 
> Fortunately, my email client identifies and can delete duplicates = same
> message id, same from, same to, and same creation time stamp. This leaves
> a lot of "duplicates" that uniq-mbox would have thrown away, but they
> were issued and received on different days, or were issued with different
> "from" addresses.
> 
> So first question: If I receive an email,
> message-id = <[EMAIL PROTECTED]>
> from = The Savvy Investor <[EMAIL PROTECTED]>
> to = [EMAIL PROTECTED]
> dated Wed, 26 Nov 2003 20:53:43 01800
> and a few minutes later I receive effectively the same email, with the
> same message-id, and the same from address, but 
> to = [EMAIL PROTECTED]
> dated Wed, 26 Nov 2003 21:00:19 01800
> is that the same spam? Is it a duplicate?

In the spam case, these *are* dups, because the message headers are
heavily randomized.

The policy is generally to remove dups in the spam corpus -- since often
they are only dups because (a) the spammer had to rerun the spam-run, (b)
it went to several email addresses that all wind up in one mailbox, (c)
broken spamware.

However if the duplication isn't very easily noticeable, don't worry
about it too much -- I generally only remove dups from my personal
mail corpus if they are "right beside each other", ie. noticeably
sent at the same time.

btw uniq-mailbox is very overaggressive; it's really only useful if
you don't care about losing quite a few messages (e.g. for spamtrap
cleaning).

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFAAwl9QTcbUG5Y7woRApTvAJ4n9rGRtQOYqbeUi/BbxdxIHL4qjgCfWPZN
GJcjQpAvZ+ePnv8+KAQ2NA4=
=PSHo
-----END PGP SIGNATURE-----



-------------------------------------------------------
This SF.net email is sponsored by: Perforce Software.
Perforce is the Fast Software Configuration Management System offering
advanced branching capabilities and atomic changes on 50+ platforms.
Free Eval! http://www.perforce.com/perforce/loadprog.html
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to