Robert Menschel wrote: > I'm trying to make sure my corpus is as clean as possible, eliminating > all duplicates. > > I tried to use the masses/corpora/uniq-mailbox program for this, and had > problems which I've documented in bugzilla report 2920. > > Fortunately, my email client identifies and can delete duplicates = same > message id, same from, same to, and same creation time stamp. This leaves > a lot of "duplicates" that uniq-mbox would have thrown away, but they > were issued and received on different days, or were issued with different > "from" addresses.
[examples deleted] > The destination email address is all the same, and the message ID is > identical up to the "@". These emails seem to even cover different > topics. How do we identify which of these are duplicates and which are > not? > > At this time I'm trusting my email client, and if there is any difference > in the email's header (from, to, date), I'm treat this as NOT a > duplicate. But I'd like to hear what other people think about these > situations. My first inclination is to say to, as it was recommended to me in one of the SA groups a while ago, generate your own checksums for all incoming mail. There's a script, in the procmail routines library at: http://pm-lib.sourceforge.net/pm-lib.html Search the page for, 'pm-jadup.rc'. This won't help with your already received corpus, but you could use a similar approach. Bryan > Bob Menschel > > ------------------------------------------------------- > This SF.net email is sponsored by: Perforce Software. > Perforce is the Fast Software Configuration Management System offering > advanced branching capabilities and atomic changes on 50+ platforms. > Free Eval! http://www.perforce.com/perforce/loadprog.html -- That's why my soul always reverts to the Old Testament and to Shakespeare. There at least one feels that it's human beings talking. There people hate, people love, people murder their enemy and curse his descendants through all generations, there people sin. - (Soren Kierkegaard - Either/Or) http://www.wecs.com/content.htm This signature file is generated by Pick-a-Tag ! Written by Jeroen van Vaarsel http://www.google.com/search?hl=en&ie=ISO-8859-1&q=pick-a-tag ------------------------------------------------------- This SF.net email is sponsored by: Perforce Software. Perforce is the Fast Software Configuration Management System offering advanced branching capabilities and atomic changes on 50+ platforms. Free Eval! http://www.perforce.com/perforce/loadprog.html _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk