Hi,

I've just implemented a simplistic map-reduce framework which mimics
the steps involved in the workflow of a Hadoop job. It basically
amounted to implementing a helper function to emit results and the
shuffle-combine step which happens in between a map and a reduce task.
Please consider that I am also still new to Hadoop so the code below
amounts to my interpretation of how a Hadoop job is structured:

(defn emit
  "Helper function to produce intermediate and final results."
  [k v]
  {:k k :v v})

(defn shuffle
  "Shuffle step where all v's from maps get grouped for the reduce steps"
  [ coll ]
  (when (seq coll)
    (sort-by :k (reduce
                 (fn [acc i]
                   (assoc acc (:k i)
                          (cons i ((:k i) acc))))
                 {} (flatten coll)))))

(defn job
  "Equivalent to a hadoop job"
  [map-fn reduce-fn coll]
  (when (seq coll)
    (map reduce-fn
         (shuffle (map map-fn coll)))))

And the necessary wordcount example:

(defn tf-mapper
  [ s ]
  (when (seq s)
    (map (fn [i] (emit i 1)) s)))

(defn tf-reducer
  [[k vs]]
  (emit k (reduce (fn [acc {v :v}] (+ acc v)) 0 vs)))

(defn tfidf
  "Calculates the term frequency of each token in the collection"
  [ coll ]
  (when (seq coll)
    (job tf-mapper tf-reducer coll)))

The wordcount example does not tokenise the documents/strings, it
assumes that each document is actually a seq, so the collection is
just a seq of seqs.

Am I right to think that replacing calls to map with pmap would make
the framework work in parallel within a single box?

Any feedback always welcome.

Cheers,

U

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Reply via email to