I keep reading this thread when I should be going to bed :) Sadly, this stuff is in what I call my "0.2% time" so I'm not working very hard on it right now.
The ruby code, which basically works (but is rather ugly in parts) relies on reading the whole file tree into memory, and then traversing the tree, saving each node (file or dir) in a repostory, which internally detects duplicates as they are added. The repository stores all known nodes, indexed by size, and then grouped into clumps of identical nodes. When you add a new node, it compares it to all clumps of the same size as the node, looking for an identical clump; if it finds one, the new node is added to the clump, otherwise it forms a new clump. (sorry if this is a bit of a vague description, I haven't worked on this code for a while so all the details are a bit vague) Once the repository is built, it's pretty easy to throw away all non-duplicate nodes, and report on the duplicates. This works, but has some issues: - it's got some kind-of ugly handling for some special cases, like making sure a directory doesn't match it's own children if it only has one child - it's a bit slow, and uses a lot of memory - it *only* handles exact matches. I've been playing with shingling and sketching algorithms that are used by search engines to identify nearly-identical documents, and I think they could be applied to this problem; in fact I suspect they could speed it up considerably. (If you want to know more about this the best reference online seems to be the book "Introduction to Information Retrieval" which is at http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html - the chapter most relevant is at http://nlp.stanford.edu/IR-book/html/htmledition/near-duplicates-and-shingling-1.html ) But, like I said, this is my 0.2% time project, so it might be some time before I really do more than think about this. :) - Korny On Fri, May 29, 2009 at 4:51 PM, Daniel Lyons <fus...@storytotell.org> wrote: > > For whatever reason I just can't seem to put this problem down. > > I have rewritten the code substantially. A major bottleneck was using > Java's MD5 classes. The "Fast MD5" library really is, and that helped > a lot. I did get the -> notation to work and I have a reasonable HOF > now for doing the winnowing, which might even be applicable to another > program someday, maybe. > > Anyway I uploaded it here: <http://clojure.googlegroups.com/web/dupfinder.clj > > and again I'd love any feedback anyone cares to give. > > Just to add insult to injury, I went ahead and programmed it again in > Ruby. The good news is that I can't seem to get Ruby to find all the > files the Clojure one finds, but the bad news is that the Ruby version > is like four times faster. I'd love to understand that. So I uploaded > that too: <http://clojure.googlegroups.com/web/dupfinder.rb>. Of > course it must be benefiting to some extent from the fact that the > Ruby version has a lot less abstraction, but I don't see how the > approach is fundamentally any different or why there would be such a > large performance disparity. I must be missing something big. > > On May 28, 2009, at 6:50 AM, Korny Sietsma wrote: >> By the way, in response to whoever suggested pre-sorting files; I >> sort-of do this (in the old ruby version) but actually, mostly the >> program is looking for duplicate *directories* of files - the goal is >> to point it at my archive disk, and have it find the biggest identical >> subdirectories. Duplicate file checking is needed for this, but it's >> only a tiny part. >> >> And I'm playing with sketching algorithms at work right now, which >> look very handy for the next phase, which is to find the biggest >> *similar* subdirectories. That's the real goal - point a program at a >> terabyte archive disk, and have it spit out : >> "/archive/old_disks/laptop_2007a is 312gb and 99% similar to >> /archive/misc/stuff_from_2007" >> ... or sorting by file count: >> "/archive/source/old_projects/c_stuff/1996 is 20,324 files and 97% >> similar to /archive/old/disks/laptop2006/unsorted/old_drives/ >> old_archive/c_cpp_stuff/90s" > > > I can think of three ways to approach this, none of which are > particularly easy. > > The first is to take the duplicate file finding function and look for > common suffixes of paths. It could almost be like running a fuzzy > duplicate finder against your duplicates. I suspect the performance > here will blow for no particular reason I can put my finger on. > > The second approach would be to use something like rsync's rolling > checksum. I bet you could use a rather stupid heuristic to find > candidate directories for similarity and then apply this algorithm > amongst the candidates. It would be handy if you could have a version > of diff that would look at two directories recursively and give up if > they were sufficiently different. > > Another option would be to ignore the filesystem altogether and look > for runs of similar blocks on the device directly. For some reason I > think this will perform better but it'll probably be harder to reason > about during development and I doubt it will be a cakewalk from Java. > > In any event, that's a very interesting problem, but for some reason > it's setting off my alarm bells. I think if it's going to be useful > it'll probably wind up being exponential time, and that compromises to > make it run acceptably will probably wind up diminishing the value. I > have no idea why I think these things though... maybe I'm just up too > late. > > Another point: most OSes and filesystems have a change notification > API now (or an emulation of one) which you could use to maintain > indices, if you want to do something more than once. This probably > isn't relevant to this problem though. It's on my mind because I was > just recently hacking on something for a friend that uses Mac OS X's > API to do that. > > Hope any of this helps and looking forward to (all) your feedback, :) > > — > Daniel Lyons > http://www.storytotell.org -- Tell It! > > > > > -- Kornelis Sietsma korny at my surname dot com "Every jumbled pile of person has a thinking part that wonders what the part that isn't thinking isn't thinking of" --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -~----------~----~----~----~------~----~------~--~---