On Feb 2, 7:53 pm, Wardrop <t...@tomwardrop.com> wrote: > I feel like I'm over-staying my welcome by posting yet another topic, > so please only answer if you get some form of enjoyment out of solving > such problems as this one. > > I've given this problem a fair bit of my time, and it's been good so > far as it's forced me to learn new things and challenge my rather > immature knowledge of clojure. I want to turn to the answers section > now though, which is what I'm hoping to get on this forum. So here's > what I'm trying to do... > > I need to pass a 32mb text file of duplicate file entries. I need to > be able to get the total number of duplicates as well as the total > size the duplicates are taking up. As a bonus, it would be good to > provide an additional categorisation by file extension (.jpg = > 450.02mb, .bak = 5.65GB, etc). Here's how the file is formatted... > > 71 byte(null)each: > ./atgiss1/profiles/rebeccat/DataWorks/DataWorks LIVE/dwlui.ini > ./atgiss1/profiles/alistairh/DataWorks/DataWorks LIVE/dwlui.ini > > 14171 byte(null)each: > ./atgiss1/profiles/rebeccat/My Documents/Corel User Files/WT9_1US.UWL > ./atgiss1/profiles/guyc/My Documents/Corel User Files/WT9_1US.UWL > ./atgiss1/profiles/carls/My Documents/Corel User Files/WT9_1US.UWL > > 102 byte(null)each: > ./atgiss1/profiles/rebeccat/Application Data/AdobeUM/AcRdB7_0_7.sta > ./atgiss1/profiles/rebeccat/Application Data/AdobeUM/AcRdB7_0_8.sta > > Do note however, that when computing sizes, do not take into account > the first file, as we only want to know the total space being taken up > by the duplicates, not the original file. > > I've already implemented this in Scala, where I use global variable to > keep track of persistent data. I'm finding it hard to morph that > concept into Clojure, which makes me believe I'm going about it the > wrong way. Hopefully someone here can demonstrate one of the right > ways. To get you started, here's a bit of a proof of concept... > > (use '[clojure.contrib.duck-streams]) > > (for [line (line-seq (reader "C:\\atgisfiledupes.txt"))] > (some #(if ((first %) 1) %) > [{:size (get (re-matches #"([0-9]+) byte\(null\)each:" line) 1)} > {:file (get (re-matches #".*(\.[0-9a-zA-Z]+)" line) 1)} > (if (= line "") {:blank true} {:other true})])) > > (println "Finished!") > > If anything, it should give the regex you need to extract data from > the various lines. > > All replies are much appreciated. > > Cheers
How about this: http://gist.github.com/293388 Once you call (read-dups "file.txt") and have the seq of dup structs, you can extract what you need. -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en