Hi,

I do not think it has anything to do with thread sync or jit+gzip as a 
matter of fact.

Why threads aren't the issue:
I've downloaded the code on my machine and the clojure code always run 
slower no matter if I read one or two files, use gzip or not. 
You run the test case using  (future) and for each instance a new 
InputStream+Reader is created which means no inter thread sharing is done.

Why not the jit+gzip:
During my runs the clojure code with or without gzip always run slower than 
the java code.

Issue:
You are not comparing apples with apples. Looking at the Java code you are 
not creating any intermediate List/Seq structure and then counting what is 
in the List/Seq,
this is exactly what the Clojure code is doing, line-seq creates an 
intermediate structure that can never be jitted away.

In the java code you are testing: how fast you can read lines of text, and 
increment a counter.
In the clojure code you are testing: how fast you can read lines for text, 
how fast you can create a Sequence, and how fast can you count elements in 
the Sequence, iterating through that Sequence.

I've written a clojure function that does what the java code does i.e count 
lines at https://gist.github.com/gerritjvv/a9b4ab17f9f4d4b6cdb7 (also 
pasted below) and this runs as fast as the java code.

The clojure code is:
(defn gz-line-counter [^String file-name] (let [^BufferedReader reader (
BufferedReader. (InputStreamReader. (GZIPInputStream. (FileInputStream. 
file-name))))] (try (loop [i 0] (if (.readLine reader) (recur (inc i)) i)) (
finally (.close reader))))) 


My suggestion would be that, use line-seq when you are interested in 
treating file lines as a stream of data. 
If you are interested in just counting avoid intermediate data structures.
If you are interested in brute force performance and still need to line 
processing use loop/recur but only after having tried out line-seq first,
allot of times the cost of the intermediate data structure is not 
noticeable when compared to the string processing/reading logic, but this 
depends
on each use case.


On Wednesday, 16 September 2015 07:00:35 UTC+2, Andy L wrote:
>
>
> Hi,
>
> Thanks for looking into my questions. I posted a self contained example 
> here https://github.com/coreasync/parallel-gzip with instructions how to 
> create test data as well. Also attached results below I get on my quite 
> decent hardware (partial 'time' results are mangled, was not sure how to 
> separate them). I use two separate 'lazy-seq', however I heard somewhere, 
> that they are not free even if no synchronization takes place, like in this 
> case but could be optimized out for a single thread situation. Apologies 
> for jumping into conclusion ... Also, I do not believe that we deal with a 
> significant amount of IO as those test files easily fit into O/S buffers.
>
> Two test runs below show, that we can easily take advantage of multiple 
> cores. Java versions scale well. Same in the Clojure code for uncompressed 
> files. In all 3 cases, resulting in JVM taking a stable 200% of CPU, i.e. 
> occupying two cores. Also Java and Clojure time numbers are quite 
> consistent.
>
> However, as soon as I add a GZIPInputStream input stream, Clojure version 
> start pegging 400, 500, 600% of CPU varying over time. I assumed initially, 
> taht effort was spend for some thread synchronization tasks as JIT was not 
> able to factor out due to more code involved. Interestingly enough, YourKit 
> shows only two threads busy interlaced with empty spaces, almost looking 
> like JVM being busy doing some kind of house keeping, hitting CPU really 
> bad. Thread dumps did not reveal anything weird, no locking contention, etc 
> ... I tried Java 7 and 8 as well as Clojure 1.7 and 1.8 - none of make any 
> difference.
>
> Understanding where that limitation comes from is quite critical, as I try 
> to use hardware to the best possible extend.
>
> Thanks in advance for hints and clues ...
> AndyL
>
>
> # create test data
> $curl -o 1 http://norvig.com/big.txt
> $cat 1 1 1 1 1 1 1 1 > 2
> $cat 2 2 2 2 2 2 2 2 > 3
> $cat 3 3 3 3 3 3 3 3 > 4
> $gzip -k 4
> $lein run 4
> starting...
>
> uncompressed
> Java code:
> "Elapsed time: 8258.013802 msecs"
> (65769984)
> "Elapsed time: 8268.641987 msecs"
> Clojure code:
> "Elapsed time: 9117.814135 msecs"
> (65769984)
> "Elapsed time: 9118.270526 msecs"
>
> compressed
> Java code:
> "Elapsed time: 21522.20167 msecs"
> (65769984)
> "Elapsed time: 21522.663463 msecs"
> Clojure code:
> "Elapsed time: 21573.585966 msecs"
> (65769984)
> "Elapsed time: 21574.013417 msecs"
> ...finished
> $ lein run 4 4
> starting...
>
> uncompressed
> Java code:
> ""EEllaappsseedd  ttiimmee::  77226688..0857983348 msec1s "m
> secs"
> (65769984 65769984)
> "Elapsed time: 7280.09169 msecs"
> Clojure code:
> ""EEllaappsseedd  ttiimmee::  99117777..113308627362  mmsseeccss""
>
> (65769984 65769984)
> "Elapsed time: 9177.644745 msecs"
>
> compressed
> Java code:
> "Elapsed time: 22324.81872 msecs"
> "Elapsed time: 23122.111874 msecs"
> (65769984 65769984)
> "Elapsed time: 23122.511818 msecs"
> Clojure code:
> "Elapsed time: 75968.051536 msecs"
> "Elapsed time: 76018.787437 msecs"
> (65769984 65769984)
> "Elapsed time: 76019.215303 msecs"
> ...finished
>
>
>
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to