One thing slowing you down is that your function "parallel-process" is calling fold on a line-seq, which is not a foldable source, so you won't get any parallelism. It devolves to a sequential reduce.
As an alternative, consider partitioning the lines into batches of a few thousand, then pipelining the calculation over batch, then calling (merge-with +) on all the intermediate results. clojure.core.async/pipeline is a useful function for this. On Tuesday, September 12, 2017 at 12:43:58 PM UTC-4, darren...@gmail.com wrote: > > Hi, > > I am a researcher of Natural Language Processing. > My team want to know how well does Clojure parallelize and how much time > is reduced compared by Java single thread version. > > The problem we want to solve is, > there is a big corpus file (just now 500MB). > Reading sentences line by line, find all patterns and their occurrence > count on length 1 through 12. > > It is a very simple problem and It doesn’t care of order of processing. > We want to make just a big hash-map. (Key is a pattern string, Value is a > occurrence count.) > Ex) { “father” 10000000 “mother” 10000000 … } > > Comparing performance between Java and Clojure, if Clojure version is > better than Java, > then we’ll change our code base to Clojure, if not, we cannot help staying > Java. > > Anyway my first prototype is very very slow. I’m a novice. :( > > Please give me some advices. > Thanks. > > (ns parallel-test.core > (:require [clojure.java.io :as jio] > [clojure.core.reducers :refer [fold]]) > (:gen-class)) > > (def corpus-file-url "resources/korean.txt") > (def OC (atom nil)) > (def MPL 12) > (def each-size 10000) > > (defn add-pattern-to-hashmap > [h-map ^String ptn ^Integer ptn-oc] > (let [h-ptn-oc (get h-map ptn) > n-ptn-oc (if (nil? h-ptn-oc) > ptn-oc > (+ h-ptn-oc ptn-oc))] > (assoc h-map ptn n-ptn-oc))) > > (defn merge-hash-map > ([] (hash-map)) > ([& hs] > (reduce (fn [l-map r-map] > (reduce (fn [[ptn ptn-oc]] > (add-pattern-to-hashmap l-map ptn ptn-oc)) > r-map)) > hs))) > > (defn cal-line-oc > ([] (hash-map)) > ([h-map ^String line] > (let [line-length (count line)] > (loop [i 0 > i-map h-map] > (if (>= i line-length) > i-map > (recur (inc i) > (loop [j 1 > j-map i-map] > (let [end-index (+ i j)] > (if (or (> j MPL) (> end-index line-length)) > j-map > (recur (inc j) > (add-pattern-to-hashmap j-map (subs line i end-index) > 1))))))))))) > > (defn parallel-process > [combine-fn reduce-fn input-file] > (with-open [rdr (jio/reader input-file)] > (fold each-size > combine-fn > reduce-fn > (line-seq rdr)))) > > (defn -main [& args] > (println "start") > (reset! OC (parallel-process merge-hash-map cal-line-oc corpus-file-url)) > (println "end")) > > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.