Hi, I've just implemented a simple map-reduce framework which mimics the steps involved in the workflow of a Hadoop job. It basically amounted to implementing a helper function to emit results and the shuffle-combine step which happens in between a map and a reduce task. Please consider that I am also still new to Hadoop so the code below amounts to my interpretation of how a Hadoop job is structured:
(defn emit "Helper function to produce intermediate and final results." [k v] {:k k :v v}) (defn shuffle "Shuffle step where all v's from maps get grouped for the reduce steps" [ coll ] (when (seq coll) (sort-by :k (reduce (fn [acc i] (assoc acc (:k i) (cons i ((:k i) acc)))) {} (flatten coll))))) (defn job "Equivalent to a hadoop job" [map-fn reduce-fn coll] (when (seq coll) (map reduce-fn (shuffle (map map-fn coll))))) And the necessary wordcount example: (defn tf-mapper [ s ] (when (seq s) (map (fn [i] (emit i 1)) s))) (defn tf-reducer [[k vs]] (emit k (reduce (fn [acc {v :v}] (+ acc v)) 0 vs))) (defn tf "Calculates the term frequency of each token in the collection" [ coll ] (when (seq coll) (job tf-mapper tf-reducer coll))) user> (tf [[:a :b] [:a :b :c] [:c]]) ({:k :c, :v 2} {:k :b, :v 2} {:k :a, :v 2}) The wordcount example does not tokenise the documents/strings, it assumes that each document is actually a seq, so the collection is just a seq of seqs. Am I right to think that replacing calls to map with pmap would make the framework work in parallel within a single box? Any feedback always welcome. Cheers, U -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en