On May 27, 2009, at 7:52 PM, Korny Sietsma wrote: > How would I do this in a functional way? My first effort would be > something like > (defn hash [filename] (memoize (... hash function ...))) > but I have a couple of problems with this: > - it doesn't seem to store the hash value with the rest of the file > information, which feels a bit ugly > - I assume it means storing the full filename three times, once in > the original file info structure, once in the memoized hash function, > and once in the memoized quickhash function. My program struggles to > get enough RAM to track as many files as I'd like already - storing > the filename multiple times would blow out memory quite badly. > > I guess I could define a unique key for each filename, and define hash > as a function on that key, but then hash would need to be able to > access the list of filenames somehow. It's starting to get beyond me > - I'm hoping there's a simpler option!
OK, I'm probably talking out of turn here, but I thought I would share some of my thoughts. First of all, though you probably already thought of this, you can improve time complexity from O(N^2) to O(N log N) if you sort the files looking for duplicates rather than comparing every file to every other file. In other words, make a list of files and their hashes, and sort by the hashes, then walk the list comparing each hash to the previous hash. I can think of two ways to do this. The first one I thought of might be more functional, which would be to successively narrow down a list of potential duplicates comparison by comparison. It would have a structure something like this: generate a list of files by size; group by files of the same size; remove all sublists of length 1 (unique sizes). Pass that list to a function which generates a list of files by the quick hash; group by them by it; remove all sublists of length 1. Pass that to a function which does the same thing only with the more expensive hash. I suspect you could make each of the three steps a function and then make a fourth function which combines those three and takes a function as a parameter to produce the hash. Then you could use the comp function to combine those three functions and presto, you have your list of duplicates. Another possible way that more closely resembles what you're doing right now would be to have a function which looks at a file and lazily produces the three comparison values, then do the sort/group procedure with a comparison function that is smart enough to only compare as much as it has to. I think you could build this with what Mikio was proposing, because map is lazy. As another example of this: (defn get-comparators [str] (cons (nth str 0) (lazy-seq (do (println "Looking at second element") (Thread/sleep 1000) (list (nth str 1)))))) This function returns a list of two values which are just the first two characters of a string, but it prints a message and waits a second if the second value in the list is actually forced. If you never look at the second value, you never pay the penalty for computing it. So supposing a compare a list of strings which differ on the first character, it should return immediately without printing anything, and it does: user> (sort-by get-comparators #(some (complement zero?) (map compare %1 %2)) ["ab" "cd" "ef" "gh"]) ("ab" "cd" "ef" "gh") But if some of them have the same first character, it should have a delay and then return the values: user> (sort-by get-comparators #(some (complement zero?) (map compare %1 %2)) ["ab" "cd" "ef" "eg"]) Looking at second element Looking at second element ("ab" "cd" "ef" "eg") So you can see that even though it had to use the second values to compare 'ef' and 'eg', it only had to "compute" their second characters, not the second characters of any of the other elements. There is a group-by function in clojure.contrib.seq-utils which would be of use. Unfortunately I don't see how to do filter with a hash-map to produce another hash-map. The best I came up with was this: (def groupings (clojure.contrib.seq-utils/group-by #(first (get-comparators %)) ["abc" "abd" "hello" "gravy" "stuff" "stripes"])) (apply hash-map (apply concat (filter (fn [[key value]] (> (count value) 1)) groupings))) I don't trust apply, and double apply seems really fishy. I'm probably missing something obvious there. Also, it would probably be better to either be able to furnish group-by with a comparison function a la sort-by or else somehow make lists be instances of comparable to make this work using get-comparators. One more note, because you mentioned some concern over having filenames in memory. I'm not sure but I don't think Java and therefore Clojure can do this, but in general (at least on Unix-based systems) you ought to be able to store the inode and look up everything else at considerable savings. Your storage costs then ought to be something like a 4-8 bytes per file. Since you don't care about the filename for identifying duplicates, just displaying them, that could save you some overhead during the comparison stage, but I'd only bother with it if I knew I had to compare a *lot* of files and I only cared about the result during one run, as inodes are frequently recycled. None of this really answers your question about internally mutable state. Of course internally mutable state is possible, either using closed-over variables or the OOP facilities of Clojure. I have a recent message which gives examples of both, which you can see here: <http://groups.google.com/group/clojure/msg/de5246273c08571f >. Hope this helps, — Daniel Lyons http://www.storytotell.org -- Tell It! --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -~----------~----~----~----~------~----~------~--~---