simple map-reduce framework

Ulises Thu, 27 Jan 2011 22:32:12 -0800

Hi,

I've just implemented a simple map-reduce framework which mimics the
steps involved in the workflow of a Hadoop job. It basically amounted
to implementing a helper function to emit results and the
shuffle-combine step which happens in between a map and a reduce task.
Please consider that I am also still new to Hadoop so the code below
amounts to my interpretation of how a Hadoop job is structured:


(defn emit
  "Helper function to produce intermediate and final results."
  [k v]
  {:k k :v v})

(defn shuffle
  "Shuffle step where all v's from maps get grouped for the reduce steps"
  [ coll ]
  (when (seq coll)
    (sort-by :k (reduce
                 (fn [acc i]
                   (assoc acc (:k i)
                          (cons i ((:k i) acc))))
                 {} (flatten coll)))))

(defn job
  "Equivalent to a hadoop job"
  [map-fn reduce-fn coll]
  (when (seq coll)
    (map reduce-fn
         (shuffle (map map-fn coll)))))

And the necessary wordcount example:

(defn tf-mapper
  [ s ]
  (when (seq s)
    (map (fn [i] (emit i 1)) s)))

(defn tf-reducer
  [[k vs]]
  (emit k (reduce (fn [acc {v :v}] (+ acc v)) 0 vs)))

(defn tf
  "Calculates the term frequency of each token in the collection"
  [ coll ]
  (when (seq coll)
    (job tf-mapper tf-reducer coll)))

user> (tf [[:a :b] [:a :b :c] [:c]])
({:k :c, :v 2} {:k :b, :v 2} {:k :a, :v 2})

The wordcount example does not tokenise the documents/strings, it
assumes that each document is actually a seq, so the collection is
just a seq of seqs.

Am I right to think that replacing calls to map with pmap would make
the framework work in parallel within a single box?

Any feedback always welcome.

Cheers,

U

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

simple map-reduce framework

Reply via email to