Hi, I do not think it has anything to do with thread sync or jit+gzip as a matter of fact.
Why threads aren't the issue: I've downloaded the code on my machine and the clojure code always run slower no matter if I read one or two files, use gzip or not. You run the test case using (future) and for each instance a new InputStream+Reader is created which means no inter thread sharing is done. Why not the jit+gzip: During my runs the clojure code with or without gzip always run slower than the java code. Issue: You are not comparing apples with apples. Looking at the Java code you are not creating any intermediate List/Seq structure and then counting what is in the List/Seq, this is exactly what the Clojure code is doing, line-seq creates an intermediate structure that can never be jitted away. In the java code you are testing: how fast you can read lines of text, and increment a counter. In the clojure code you are testing: how fast you can read lines for text, how fast you can create a Sequence, and how fast can you count elements in the Sequence, iterating through that Sequence. I've written a clojure function that does what the java code does i.e count lines at https://gist.github.com/gerritjvv/a9b4ab17f9f4d4b6cdb7 (also pasted below) and this runs as fast as the java code. The clojure code is: (defn gz-line-counter [^String file-name] (let [^BufferedReader reader ( BufferedReader. (InputStreamReader. (GZIPInputStream. (FileInputStream. file-name))))] (try (loop [i 0] (if (.readLine reader) (recur (inc i)) i)) ( finally (.close reader))))) My suggestion would be that, use line-seq when you are interested in treating file lines as a stream of data. If you are interested in just counting avoid intermediate data structures. If you are interested in brute force performance and still need to line processing use loop/recur but only after having tried out line-seq first, allot of times the cost of the intermediate data structure is not noticeable when compared to the string processing/reading logic, but this depends on each use case. On Wednesday, 16 September 2015 07:00:35 UTC+2, Andy L wrote: > > > Hi, > > Thanks for looking into my questions. I posted a self contained example > here https://github.com/coreasync/parallel-gzip with instructions how to > create test data as well. Also attached results below I get on my quite > decent hardware (partial 'time' results are mangled, was not sure how to > separate them). I use two separate 'lazy-seq', however I heard somewhere, > that they are not free even if no synchronization takes place, like in this > case but could be optimized out for a single thread situation. Apologies > for jumping into conclusion ... Also, I do not believe that we deal with a > significant amount of IO as those test files easily fit into O/S buffers. > > Two test runs below show, that we can easily take advantage of multiple > cores. Java versions scale well. Same in the Clojure code for uncompressed > files. In all 3 cases, resulting in JVM taking a stable 200% of CPU, i.e. > occupying two cores. Also Java and Clojure time numbers are quite > consistent. > > However, as soon as I add a GZIPInputStream input stream, Clojure version > start pegging 400, 500, 600% of CPU varying over time. I assumed initially, > taht effort was spend for some thread synchronization tasks as JIT was not > able to factor out due to more code involved. Interestingly enough, YourKit > shows only two threads busy interlaced with empty spaces, almost looking > like JVM being busy doing some kind of house keeping, hitting CPU really > bad. Thread dumps did not reveal anything weird, no locking contention, etc > ... I tried Java 7 and 8 as well as Clojure 1.7 and 1.8 - none of make any > difference. > > Understanding where that limitation comes from is quite critical, as I try > to use hardware to the best possible extend. > > Thanks in advance for hints and clues ... > AndyL > > > # create test data > $curl -o 1 http://norvig.com/big.txt > $cat 1 1 1 1 1 1 1 1 > 2 > $cat 2 2 2 2 2 2 2 2 > 3 > $cat 3 3 3 3 3 3 3 3 > 4 > $gzip -k 4 > $lein run 4 > starting... > > uncompressed > Java code: > "Elapsed time: 8258.013802 msecs" > (65769984) > "Elapsed time: 8268.641987 msecs" > Clojure code: > "Elapsed time: 9117.814135 msecs" > (65769984) > "Elapsed time: 9118.270526 msecs" > > compressed > Java code: > "Elapsed time: 21522.20167 msecs" > (65769984) > "Elapsed time: 21522.663463 msecs" > Clojure code: > "Elapsed time: 21573.585966 msecs" > (65769984) > "Elapsed time: 21574.013417 msecs" > ...finished > $ lein run 4 4 > starting... > > uncompressed > Java code: > ""EEllaappsseedd ttiimmee:: 77226688..0857983348 msec1s "m > secs" > (65769984 65769984) > "Elapsed time: 7280.09169 msecs" > Clojure code: > ""EEllaappsseedd ttiimmee:: 99117777..113308627362 mmsseeccss"" > > (65769984 65769984) > "Elapsed time: 9177.644745 msecs" > > compressed > Java code: > "Elapsed time: 22324.81872 msecs" > "Elapsed time: 23122.111874 msecs" > (65769984 65769984) > "Elapsed time: 23122.511818 msecs" > Clojure code: > "Elapsed time: 75968.051536 msecs" > "Elapsed time: 76018.787437 msecs" > (65769984 65769984) > "Elapsed time: 76019.215303 msecs" > ...finished > > > > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.