Re: Exercise: words frequency ranking

wwmorgan Thu, 25 Dec 2008 08:25:04 -0800

A better implementation would split the different steps of the program
into separate functions. This increases readability and testability of
the source code, and encourages the reuse of code in new programs. I
haven't tested this program, but hopefully you'll understand the
general approach.


Also, you're using a def inside your defn. In all likelihood this
doesn't do what you think it does (defines a locally-scoped variable)
because all def's are global. One consequence of this: if more than
one thread executes topwords (for different files arguments) at the
same time, then a race condition could make one or both of them
generate completely wrong results, and the resulting bug would be
maddeningly difficult to track down. Instead of def, use let instead.

(defn freqs
  "Given a collection, return a mapping of unique elements in the
collection
   to the number of times that the element appears"
  [coll]
    (merge-with + {}
      (for [el coll]
        {el 1})))

(defn word-seq
  "Given a file, return a seq of every word in the file, normalizing
words by
   coverting them to lower case and splitting on whitespace"
  [file]
    (seq (.split (.toLowerCase (slurp in-filepath "\\s+")))))

(defn freq-format
  "Given a map from words to their frequencies, return a pretty
string,
   sorted in descending order by number of appearances"
   [freq]
     (apply str
            (map #(format "%20s : %5d \r\n" (key %) (val %))
                 (sort-by #(- (val %))
                          freq))))

(defn topwords
  "Compute the frequencies of each word in in-filepath. Output the
results to
   out-filepath"
  [in-filepath out-filepath]
    (spit out-filepath
          (str (freq-format (freqs (word-seq in-filepath)))
               "\r\n")))

On Dec 25, 7:16 am, Piotr 'Qertoip' Włodarek <qert...@gmail.com>
wrote:
> Given the input text file, the program should write to disk a ranking
> of words sorted by frequency, like:
>
>                  the : 52483
>                  and : 32558
>                   of : 23477
>                    a : 22486
>                   to : 21993
>
> My first implementation:
>
> (defn topwords [in-filepath, out-filepath]
>   (def words (.split (.toLowerCase (slurp in-filepath)) "\\s+"))
>
>   (spit out-filepath
>         (apply  str
>                 (concat
>                   (map (fn [pair] (format "%20s : %5d \r\n" (key pair)
> (val pair)))
>                        (sort-by #( -(val %) )
>                                 (reduce
>                                   (fn [counted-words word]
>                                       ( assoc counted-words
>                                               word
>                                               (inc (get counted-words
> word 0)) ))
>                                   {}
>                                   words)))
>                   ["\r\n"]))))
>
> Somehow I feel it's far from optimal. Could you please advise and
> improve? What is the best, idiomatic implementation of this simple
> problem?
>
> regards,
> Piotrek
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To post to this group, send email to clojure@googlegroups.com
To unsubscribe from this group, send email to 
clojure+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/clojure?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: Exercise: words frequency ranking

Reply via email to