As an aside,
formail -D 2 /tmp/dup_id_cache.$$ -s < mbox.txt > mbox_no_dupes.txt
rm -f /tmp/dup_id_cache.$$
will do a decent job of weeding out duplicates (based upon message id),
where 2 is the size of the id cache.
---
This SF
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Robert Menschel writes:
> I'm trying to make sure my corpus is as clean as possible, eliminating
> all duplicates.
>
> I tried to use the masses/corpora/uniq-mailbox program for this, and had
> problems which I've documented in bugzilla report 2920.
I'm trying to make sure my corpus is as clean as possible, eliminating
all duplicates.
I tried to use the masses/corpora/uniq-mailbox program for this, and had
problems which I've documented in bugzilla report 2920.
Fortunately, my email client identifies and can delete duplicates = same
message