> From: Eric Gerlach [mailto:egerl...@feds.uwaterloo.ca] > Sent: Friday, February 27, 2009 11:03 AM > Subject: Re: finding similar files > > On Wed, Feb 25, 2009 at 06:58:48PM +0000, Hendrik Boom wrote: > > There wouldn't happen to be any handy tools for searching a directory > > tree with a few hundred ASCII files and telling me which ones have > > similar content? > > > > Many have been copied, edited, merged, reformatted, split, and I'd like > > to find the differences, decide on what to keep, and delete redundant > > ones. > > > > I know there's such a program for image files. > > > > I know about wdiff, which would be fine after I've paired off the > similar > > files (or fragments of files). to resolve differences that remain. > > You could write a script that would brute force all possible pairs of > files > (yes, I know that's big, but it's only 125 000 for 500 files), run them > through > "wdiff -s", and then set some threshold for similarity on the statistics. > Then, you get a list of potential matches. > > The only trick is setting the threshold... and I have no idea how to help > you > there. > > And if you're looking for fragments of files, that's a whole different > ballgame. > > Cheers, > > -- > Eric Gerlach, Network Administrator > Federation of Students > University of Waterloo > p: (519) 888-4567 x36329 > e: egerl...@feds.uwaterloo.ca
I would probably do something similar as what Eric mentioned, but I would weed out duplicates first. Try using fdupes. I tend to use: `fdupes /your/dir/ -rS` Add the -d to it to delete as you go, but I highly encourage you to read up on the man page first and probably test it on something you don't care for so you know how it works. Hope this helps! ~Stack~ -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org