Michal, thanks for your example. I didn't seem to work too well for me
though, but they may have been partly my fault. It did expose me to
however to some new concepts and their uses.

Ataggart, what an excellent solution. It's pretty much exactly what I
was after, and the code is very clean and easy to understand. Thanks a
lot for that. The use of (->>) was very clever. Your example certainly
makes Clojure look good.

On Feb 3, 4:28 pm, ataggart <alex.tagg...@gmail.com> wrote:
> On Feb 2, 7:53 pm, Wardrop <t...@tomwardrop.com> wrote:
>
>
>
> > I feel like I'm over-staying my welcome by posting yet another topic,
> > so please only answer if you get some form of enjoyment out of solving
> > such problems as this one.
>
> > I've given this problem a fair bit of my time, and it's been good so
> > far as it's forced me to learn new things and challenge my rather
> > immature knowledge of clojure. I want to turn to the answers section
> > now though, which is what I'm hoping to get on this forum. So here's
> > what I'm trying to do...
>
> > I need to pass a 32mb text file of duplicate file entries. I need to
> > be able to get the total number of duplicates as well as the total
> > size the duplicates are taking up. As a bonus, it would be good to
> > provide an additional categorisation by file extension (.jpg =
> > 450.02mb, .bak = 5.65GB, etc). Here's how the file is formatted...
>
> > 71 byte(null)each:
> > ./atgiss1/profiles/rebeccat/DataWorks/DataWorks LIVE/dwlui.ini
> > ./atgiss1/profiles/alistairh/DataWorks/DataWorks LIVE/dwlui.ini
>
> > 14171 byte(null)each:
> > ./atgiss1/profiles/rebeccat/My Documents/Corel User Files/WT9_1US.UWL
> > ./atgiss1/profiles/guyc/My Documents/Corel User Files/WT9_1US.UWL
> > ./atgiss1/profiles/carls/My Documents/Corel User Files/WT9_1US.UWL
>
> > 102 byte(null)each:
> > ./atgiss1/profiles/rebeccat/Application Data/AdobeUM/AcRdB7_0_7.sta
> > ./atgiss1/profiles/rebeccat/Application Data/AdobeUM/AcRdB7_0_8.sta
>
> > Do note however, that when computing sizes, do not take into account
> > the first file, as we only want to know the total space being taken up
> > by the duplicates, not the original file.
>
> > I've already implemented this in Scala, where I use global variable to
> > keep track of persistent data. I'm finding it hard to morph that
> > concept into Clojure, which makes me believe I'm going about it the
> > wrong way. Hopefully someone here can demonstrate one of the right
> > ways. To get you started, here's a bit of a proof of concept...
>
> > (use '[clojure.contrib.duck-streams])
>
> > (for [line (line-seq (reader "C:\\atgisfiledupes.txt"))]
> >   (some #(if ((first %) 1) %)
> >     [{:size (get (re-matches #"([0-9]+) byte\(null\)each:" line) 1)}
> >      {:file (get (re-matches #".*(\.[0-9a-zA-Z]+)" line) 1)}
> >      (if (= line "") {:blank true} {:other true})]))
>
> > (println "Finished!")
>
> > If anything, it should give the regex you need to extract data from
> > the various lines.
>
> > All replies are much appreciated.
>
> > Cheers
>
> How about this:  http://gist.github.com/293388
>
> Once you call (read-dups "file.txt") and have the seq of dup structs,
> you can extract what you need.

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Reply via email to