On Feb 2, 7:53 pm, Wardrop <t...@tomwardrop.com> wrote:
> I feel like I'm over-staying my welcome by posting yet another topic,
> so please only answer if you get some form of enjoyment out of solving
> such problems as this one.
>
> I've given this problem a fair bit of my time, and it's been good so
> far as it's forced me to learn new things and challenge my rather
> immature knowledge of clojure. I want to turn to the answers section
> now though, which is what I'm hoping to get on this forum. So here's
> what I'm trying to do...
>
> I need to pass a 32mb text file of duplicate file entries. I need to
> be able to get the total number of duplicates as well as the total
> size the duplicates are taking up. As a bonus, it would be good to
> provide an additional categorisation by file extension (.jpg =
> 450.02mb, .bak = 5.65GB, etc). Here's how the file is formatted...
>
> 71 byte(null)each:
> ./atgiss1/profiles/rebeccat/DataWorks/DataWorks LIVE/dwlui.ini
> ./atgiss1/profiles/alistairh/DataWorks/DataWorks LIVE/dwlui.ini
>
> 14171 byte(null)each:
> ./atgiss1/profiles/rebeccat/My Documents/Corel User Files/WT9_1US.UWL
> ./atgiss1/profiles/guyc/My Documents/Corel User Files/WT9_1US.UWL
> ./atgiss1/profiles/carls/My Documents/Corel User Files/WT9_1US.UWL
>
> 102 byte(null)each:
> ./atgiss1/profiles/rebeccat/Application Data/AdobeUM/AcRdB7_0_7.sta
> ./atgiss1/profiles/rebeccat/Application Data/AdobeUM/AcRdB7_0_8.sta
>
> Do note however, that when computing sizes, do not take into account
> the first file, as we only want to know the total space being taken up
> by the duplicates, not the original file.
>
> I've already implemented this in Scala, where I use global variable to
> keep track of persistent data. I'm finding it hard to morph that
> concept into Clojure, which makes me believe I'm going about it the
> wrong way. Hopefully someone here can demonstrate one of the right
> ways. To get you started, here's a bit of a proof of concept...
>
> (use '[clojure.contrib.duck-streams])
>
> (for [line (line-seq (reader "C:\\atgisfiledupes.txt"))]
>   (some #(if ((first %) 1) %)
>     [{:size (get (re-matches #"([0-9]+) byte\(null\)each:" line) 1)}
>      {:file (get (re-matches #".*(\.[0-9a-zA-Z]+)" line) 1)}
>      (if (= line "") {:blank true} {:other true})]))
>
> (println "Finished!")
>
> If anything, it should give the regex you need to extract data from
> the various lines.
>
> All replies are much appreciated.
>
> Cheers

How about this:  http://gist.github.com/293388

Once you call (read-dups "file.txt") and have the seq of dup structs,
you can extract what you need.

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Reply via email to