Robert Menschel wrote:
> I'm trying to make sure my corpus is as clean as possible, eliminating
> all duplicates.
> 
> I tried to use the masses/corpora/uniq-mailbox program for this, and had
> problems which I've documented in bugzilla report 2920.
> 
> Fortunately, my email client identifies and can delete duplicates = same
> message id, same from, same to, and same creation time stamp. This leaves
> a lot of "duplicates" that uniq-mbox would have thrown away, but they
> were issued and received on different days, or were issued with different
> "from" addresses.

[examples deleted]

> The destination email address is all the same, and the message ID is
> identical up to the "@". These emails seem to even cover different
> topics. How do we identify which of these are duplicates and which are
> not?
> 
> At this time I'm trusting my email client, and if there is any difference
> in the email's header (from, to, date), I'm treat this as NOT a
> duplicate. But I'd like to hear what other people think about these
> situations.

My first inclination is to say to, as it was recommended to me in one of
the SA groups a while ago, generate your own checksums for all incoming
mail.  

There's a script, in the procmail routines library at:  

http://pm-lib.sourceforge.net/pm-lib.html

Search the page for, 'pm-jadup.rc'.

This won't help with your already received corpus, but you could use a
similar approach.

Bryan

> Bob Menschel
> 
> -------------------------------------------------------
> This SF.net email is sponsored by: Perforce Software.
> Perforce is the Fast Software Configuration Management System offering
> advanced branching capabilities and atomic changes on 50+ platforms.
> Free Eval! http://www.perforce.com/perforce/loadprog.html

-- 
That's why my soul always reverts to the Old Testament and to
Shakespeare.  There at least one feels that it's human beings talking. 
There people hate, people love, people murder their enemy and curse his
descendants through all generations, there people sin. - (Soren
Kierkegaard - Either/Or)

http://www.wecs.com/content.htm

This signature file is generated by Pick-a-Tag !
Written by Jeroen van Vaarsel
http://www.google.com/search?hl=en&ie=ISO-8859-1&q=pick-a-tag



-------------------------------------------------------
This SF.net email is sponsored by: Perforce Software.
Perforce is the Fast Software Configuration Management System offering
advanced branching capabilities and atomic changes on 50+ platforms.
Free Eval! http://www.perforce.com/perforce/loadprog.html
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to