Hi, I am a researcher of Natural Language Processing. My team want to know how well does Clojure parallelize and how much time is reduced compared by Java single thread version.
The problem we want to solve is, there is a big corpus file (just now 500MB). Reading sentences line by line, find all patterns and their occurrence count on length 1 through 12. It is a very simple problem and It doesn’t care of order of processing. We want to make just a big hash-map. (Key is a pattern string, Value is a occurrence count.) Ex) { “father” 10000000 “mother” 10000000 … } Comparing performance between Java and Clojure, if Clojure version is better than Java, then we’ll change our code base to Clojure, if not, we cannot help staying Java. Anyway my first prototype is very very slow. I’m a novice. :( Please give me some advices. Thanks. (ns parallel-test.core (:require [clojure.java.io :as jio] [clojure.core.reducers :refer [fold]]) (:gen-class)) (def corpus-file-url "resources/korean.txt") (def OC (atom nil)) (def MPL 12) (def each-size 10000) (defn add-pattern-to-hashmap [h-map ^String ptn ^Integer ptn-oc] (let [h-ptn-oc (get h-map ptn) n-ptn-oc (if (nil? h-ptn-oc) ptn-oc (+ h-ptn-oc ptn-oc))] (assoc h-map ptn n-ptn-oc))) (defn merge-hash-map ([] (hash-map)) ([& hs] (reduce (fn [l-map r-map] (reduce (fn [[ptn ptn-oc]] (add-pattern-to-hashmap l-map ptn ptn-oc)) r-map)) hs))) (defn cal-line-oc ([] (hash-map)) ([h-map ^String line] (let [line-length (count line)] (loop [i 0 i-map h-map] (if (>= i line-length) i-map (recur (inc i) (loop [j 1 j-map i-map] (let [end-index (+ i j)] (if (or (> j MPL) (> end-index line-length)) j-map (recur (inc j) (add-pattern-to-hashmap j-map (subs line i end-index) 1))))))))))) (defn parallel-process [combine-fn reduce-fn input-file] (with-open [rdr (jio/reader input-file)] (fold each-size combine-fn reduce-fn (line-seq rdr)))) (defn -main [& args] (println "start") (reset! OC (parallel-process merge-hash-map cal-line-oc corpus-file-url)) (println "end")) -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.