One thing slowing you down is that your function "parallel-process" is 
calling fold on a line-seq, which is not a foldable source, so you won't 
get any parallelism.  It devolves to a sequential reduce.

As an alternative, consider partitioning the lines into batches of a few 
thousand, then pipelining the calculation over batch, then calling 
(merge-with +) on all the intermediate results. clojure.core.async/pipeline 
is a useful function for this.


On Tuesday, September 12, 2017 at 12:43:58 PM UTC-4, darren...@gmail.com 
wrote:
>
> Hi, 
>
> I am a researcher of Natural Language Processing.
> My team want to know how well does Clojure parallelize and how much time 
> is reduced compared by Java single thread version.
>
> The problem we want to solve is, 
> there is a big corpus file (just now 500MB).
> Reading sentences line by line, find all patterns and their occurrence 
> count on length 1 through 12.
>
> It is a very simple problem and It doesn’t care of order of processing.
> We want to make just a big hash-map. (Key is a pattern string, Value is a 
> occurrence count.)
> Ex) { “father” 10000000 “mother” 10000000 … }
>
> Comparing performance between Java and Clojure, if Clojure version is 
> better than Java, 
> then we’ll change our code base to Clojure, if not, we cannot help staying 
> Java. 
>
> Anyway my first prototype is very very slow. I’m a novice.  :( 
>
> Please give me some advices.
> Thanks.  
>
> (ns parallel-test.core
>  (:require [clojure.java.io :as jio]
>            [clojure.core.reducers :refer [fold]])
>  (:gen-class))
>
> (def corpus-file-url "resources/korean.txt")
> (def OC (atom nil))
> (def MPL 12)
> (def each-size 10000)
>
> (defn add-pattern-to-hashmap
>  [h-map ^String ptn ^Integer ptn-oc]
>  (let [h-ptn-oc (get h-map ptn)
>        n-ptn-oc (if (nil? h-ptn-oc)
>                  ptn-oc
>                  (+ h-ptn-oc ptn-oc))]
>   (assoc h-map ptn n-ptn-oc)))
>
> (defn merge-hash-map
>  ([] (hash-map))
>  ([& hs]
>   (reduce (fn [l-map r-map]
>            (reduce (fn [[ptn ptn-oc]]
>                     (add-pattern-to-hashmap l-map ptn ptn-oc))
>                    r-map))
>           hs)))
>
> (defn cal-line-oc
>  ([] (hash-map))
>  ([h-map ^String line]
>   (let [line-length (count line)]
>    (loop [i 0
>           i-map h-map]
>     (if (>= i line-length)
>      i-map
>      (recur (inc i)
>             (loop [j 1
>                    j-map i-map]
>              (let [end-index (+ i j)]
>               (if (or (> j MPL) (> end-index line-length))
>                j-map
>                (recur (inc j)
>                       (add-pattern-to-hashmap j-map (subs line i end-index) 
> 1)))))))))))
>
> (defn parallel-process
>  [combine-fn reduce-fn input-file]
>  (with-open [rdr (jio/reader input-file)]
>   (fold each-size
>         combine-fn
>         reduce-fn
>         (line-seq rdr))))
>
> (defn -main [& args]
>  (println "start")
>  (reset! OC (parallel-process merge-hash-map cal-line-oc corpus-file-url))
>  (println "end"))
>
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to