Hi, I would like ask for some advise with regards to kind of unusual interaction between lazy-seq and threads. I have a code opening some big compressed text files and processing them line by line. The code reduced to a viable example would look like that:
(with-open [i (-> "mybigfile.gz" clojure.java.io/input-stream java.util.zip.GZIPInputStream. clojure.java.io/reader)] (count (line-seq i))) where for the sake of visualization, the processing is replaced by a simple counting. In a single thread situation, everything works very well, with performance numbers close to Java (or even equal with "-XX:MaxInlineLevel=16"). However, once I run it in threads, either native Java Thread or future, instead of nice effect parallel processing, things are even slower from as they would be run sequentially. Interestingly enough, JVM pegs at 500-600% of CPU (I have 8 cores). I was not sure what was the reason, and in order to rule out some basics assumptions, I created a Java equivalent. It runs at 200% CPU and scales above 4 cores - which is exactly what I want, and matches gzip behavior. (I can run almost 6 "gunzip -c mybigfile.gz | wc -l" which all taking 100% CPU each). Next logical step was to look into Clojure sources. What I am finding out, is that lazy-seq is synchronized: https://github.com/clojure/clojure/blob/master/src/jvm/clojure/lang/LazySeq.java . From what I understand, JIT optimizes the single thread case and removes "synchronized" guards, however as soon as other threads come into play I am forced to pay price for synchronization, which causes the performance degradation*. Interestingly enough, JIT optimizes a version without GZIPInputStream and am getting same results as with Java with multiple threads. I have to run it with "-XX:MaxInlineLevel=16" though. With a default "-XX:MaxInlineLevel=9", JIT does not kick in and performance is not there. There is probably another switch in JVM which would help hinting JIT better, however I am not convinces that this is a right direction. I really like semantics of line-seq, however without that "synchronized" part, as in my context there is no way that two threads touch same seq. I would like ask for some advise, what would be my options here. The last resort is to write handling code in Java, but I really want to avoid this. Best, Andy *My analysis might be wrong of course. -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.