For whatever reason I just can't seem to put this problem down.

I have rewritten the code substantially. A major bottleneck was using  
Java's MD5 classes. The "Fast MD5" library really is, and that helped  
a lot. I did get the -> notation to work and I have a reasonable HOF  
now for doing the winnowing, which might even be applicable to another  
program someday, maybe.

Anyway I uploaded it here: <http://clojure.googlegroups.com/web/dupfinder.clj 
 > and again I'd love any feedback anyone cares to give.

Just to add insult to injury, I went ahead and programmed it again in  
Ruby. The good news is that I can't seem to get Ruby to find all the  
files the Clojure one finds, but the bad news is that the Ruby version  
is like four times faster. I'd love to understand that. So I uploaded  
that too: <http://clojure.googlegroups.com/web/dupfinder.rb>. Of  
course it must be benefiting to some extent from the fact that the  
Ruby version has a lot less abstraction, but I don't see how the  
approach is fundamentally any different or why there would be such a  
large performance disparity. I must be missing something big.

On May 28, 2009, at 6:50 AM, Korny Sietsma wrote:
> By the way, in response to whoever suggested pre-sorting files; I
> sort-of do this (in the old ruby version) but actually, mostly the
> program is looking for duplicate *directories* of files - the goal is
> to point it at my archive disk, and have it find the biggest identical
> subdirectories.  Duplicate file checking is needed for this, but it's
> only a tiny part.
>
> And I'm playing with sketching algorithms at work right now, which
> look very handy for the next phase, which is to find the biggest
> *similar* subdirectories.  That's the real goal - point a program at a
> terabyte archive disk, and have it spit out :
> "/archive/old_disks/laptop_2007a is 312gb and 99% similar to
> /archive/misc/stuff_from_2007"
> ... or sorting by file count:
> "/archive/source/old_projects/c_stuff/1996 is 20,324 files and 97%
> similar to /archive/old/disks/laptop2006/unsorted/old_drives/ 
> old_archive/c_cpp_stuff/90s"


I can think of three ways to approach this, none of which are  
particularly easy.

The first is to take the duplicate file finding function and look for  
common suffixes of paths. It could almost be like running a fuzzy  
duplicate finder against your duplicates. I suspect the performance  
here will blow for no particular reason I can put my finger on.

The second approach would be to use something like rsync's rolling  
checksum. I bet you could use a rather stupid heuristic to find  
candidate directories for similarity and then apply this algorithm  
amongst the candidates. It would be handy if you could have a version  
of diff that would look at two directories recursively and give up if  
they were sufficiently different.

Another option would be to ignore the filesystem altogether and look  
for runs of similar blocks on the device directly. For some reason I  
think this will perform better but it'll probably be harder to reason  
about during development and I doubt it will be a cakewalk from Java.

In any event, that's a very interesting problem, but for some reason  
it's setting off my alarm bells. I think if it's going to be useful  
it'll probably wind up being exponential time, and that compromises to  
make it run acceptably will probably wind up diminishing the value. I  
have no idea why I think these things though... maybe I'm just up too  
late.

Another point: most OSes and filesystems have a change notification  
API now (or an emulation of one) which you could use to maintain  
indices, if you want to do something more than once. This probably  
isn't relevant to this problem though. It's on my mind because I was  
just recently hacking on something for a friend that uses Mac OS X's  
API to do that.

Hope any of this helps and looking forward to (all) your feedback, :)

—
Daniel Lyons
http://www.storytotell.org -- Tell It!


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To post to this group, send email to clojure@googlegroups.com
To unsubscribe from this group, send email to 
clojure+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/clojure?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to