For whatever reason I just can't seem to put this problem down. I have rewritten the code substantially. A major bottleneck was using Java's MD5 classes. The "Fast MD5" library really is, and that helped a lot. I did get the -> notation to work and I have a reasonable HOF now for doing the winnowing, which might even be applicable to another program someday, maybe.
Anyway I uploaded it here: <http://clojure.googlegroups.com/web/dupfinder.clj > and again I'd love any feedback anyone cares to give. Just to add insult to injury, I went ahead and programmed it again in Ruby. The good news is that I can't seem to get Ruby to find all the files the Clojure one finds, but the bad news is that the Ruby version is like four times faster. I'd love to understand that. So I uploaded that too: <http://clojure.googlegroups.com/web/dupfinder.rb>. Of course it must be benefiting to some extent from the fact that the Ruby version has a lot less abstraction, but I don't see how the approach is fundamentally any different or why there would be such a large performance disparity. I must be missing something big. On May 28, 2009, at 6:50 AM, Korny Sietsma wrote: > By the way, in response to whoever suggested pre-sorting files; I > sort-of do this (in the old ruby version) but actually, mostly the > program is looking for duplicate *directories* of files - the goal is > to point it at my archive disk, and have it find the biggest identical > subdirectories. Duplicate file checking is needed for this, but it's > only a tiny part. > > And I'm playing with sketching algorithms at work right now, which > look very handy for the next phase, which is to find the biggest > *similar* subdirectories. That's the real goal - point a program at a > terabyte archive disk, and have it spit out : > "/archive/old_disks/laptop_2007a is 312gb and 99% similar to > /archive/misc/stuff_from_2007" > ... or sorting by file count: > "/archive/source/old_projects/c_stuff/1996 is 20,324 files and 97% > similar to /archive/old/disks/laptop2006/unsorted/old_drives/ > old_archive/c_cpp_stuff/90s" I can think of three ways to approach this, none of which are particularly easy. The first is to take the duplicate file finding function and look for common suffixes of paths. It could almost be like running a fuzzy duplicate finder against your duplicates. I suspect the performance here will blow for no particular reason I can put my finger on. The second approach would be to use something like rsync's rolling checksum. I bet you could use a rather stupid heuristic to find candidate directories for similarity and then apply this algorithm amongst the candidates. It would be handy if you could have a version of diff that would look at two directories recursively and give up if they were sufficiently different. Another option would be to ignore the filesystem altogether and look for runs of similar blocks on the device directly. For some reason I think this will perform better but it'll probably be harder to reason about during development and I doubt it will be a cakewalk from Java. In any event, that's a very interesting problem, but for some reason it's setting off my alarm bells. I think if it's going to be useful it'll probably wind up being exponential time, and that compromises to make it run acceptably will probably wind up diminishing the value. I have no idea why I think these things though... maybe I'm just up too late. Another point: most OSes and filesystems have a change notification API now (or an emulation of one) which you could use to maintain indices, if you want to do something more than once. This probably isn't relevant to this problem though. It's on my mind because I was just recently hacking on something for a friend that uses Mac OS X's API to do that. Hope any of this helps and looking forward to (all) your feedback, :) — Daniel Lyons http://www.storytotell.org -- Tell It! --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -~----------~----~----~----~------~----~------~--~---