Hi,

I would like ask for some advise with regards to kind of unusual
interaction between lazy-seq and threads. I have a code opening some big
compressed text files and processing them line by line. The code reduced to
a viable example would look like that:

  (with-open [i (-> "mybigfile.gz" clojure.java.io/input-stream
java.util.zip.GZIPInputStream. clojure.java.io/reader)] (count (line-seq
i)))

where for the sake of visualization, the processing is replaced by a simple
counting.

In a single thread situation, everything works very well, with performance
numbers close to Java (or even equal with "-XX:MaxInlineLevel=16").
However, once I run it in threads, either native Java Thread or future,
instead of nice effect parallel processing, things are even slower from as
they would be run sequentially. Interestingly enough, JVM pegs at 500-600%
of CPU (I have 8 cores). I was not sure what was the reason, and in order
to rule out some basics assumptions, I created a Java equivalent. It runs
at 200% CPU and scales above 4 cores - which is exactly what I want, and
matches gzip behavior. (I can run almost 6 "gunzip -c mybigfile.gz | wc -l"
which all taking 100% CPU each).

Next logical step was to look into Clojure sources. What I am finding out,
is that lazy-seq is synchronized:
https://github.com/clojure/clojure/blob/master/src/jvm/clojure/lang/LazySeq.java
. From what I understand, JIT optimizes the single thread case and removes
"synchronized" guards, however as soon as other threads come into play I am
forced to pay price for synchronization, which causes the performance
degradation*.

Interestingly enough, JIT optimizes a version without GZIPInputStream and
am getting same results as with Java with multiple threads. I have to run
it with "-XX:MaxInlineLevel=16" though. With a default
"-XX:MaxInlineLevel=9", JIT does not kick in and performance is not there.
There is probably another switch in JVM which would help hinting JIT
better, however I am not convinces that this is a right direction.

I really like semantics of line-seq, however without that "synchronized"
part, as in my context there is no way that two threads touch same seq.

I would like ask for some advise, what would be my options here. The last
resort is to write handling code in Java, but I really want to avoid this.

Best,
Andy

*My analysis might be wrong of course.

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to