[ 
https://issues.apache.org/jira/browse/KAFKA-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15126346#comment-15126346
 ] 

Ismael Juma edited comment on KAFKA-3174 at 2/1/16 3:02 PM:
------------------------------------------------------------

[~becket_qin] We have started recommending Java 8 around the same time we 
released 0.9.0.0 (we also mention that LinkedIn is using Java 8 there):

http://kafka.apache.org/documentation.html#java

I did some investigation so that we understand the specifics of the improvement 
to CRC32 in the JDK. It relies on SSE 2, SSE 4.1, AVX and CLMUL. SSE has been 
available for a long time, CLMUL since Intel Westmere (2010) and AVX since 
Intel Sandy Bridge (2011). It's probably OK to assume that these instructions 
will be available for those who are constrained by CPU performance.

Note that this is not using CRC32 CPU instruction as we would have to use 
CRC32C for that (see KAFKA-1449 for more details on what is possible if we are 
willing to support CRC32C).

I wrote a simple JMH benchmark:

https://gist.github.com/ijuma/f86ad935715cfd4e258e

I tested it on my Ivy Bridge MacBook on JDK 7 update 80 and JDK 8 update 76, 
configuring JMH to use 10 one second measurement iterations, 10 one second 
warmup iterations and 1 fork.

JDK 8 update 76 results:

{code}
[info] Benchmark              (bytesSize)  Mode  Cnt       Score       Error  
Units
[info] Crc32Bench.jdkCrc32              8  avgt   10      24.902 ±     0.728  
ns/op
[info] Crc32Bench.jdkCrc32             16  avgt   10      48.819 ±     2.550  
ns/op
[info] Crc32Bench.jdkCrc32             32  avgt   10      83.434 ±     2.668  
ns/op
[info] Crc32Bench.jdkCrc32            128  avgt   10     127.679 ±     5.185  
ns/op
[info] Crc32Bench.jdkCrc32           1024  avgt   10     450.105 ±    18.943  
ns/op
[info] Crc32Bench.jdkCrc32          65536  avgt   10   25579.406 ±   683.017  
ns/op
[info] Crc32Bench.jdkCrc32        1048576  avgt   10  408708.242 ± 12183.543  
ns/op
[info] Crc32Bench.kafkaCrc32            8  avgt   10      14.761 ±     0.647  
ns/op
[info] Crc32Bench.kafkaCrc32           16  avgt   10      19.114 ±     0.423  
ns/op
[info] Crc32Bench.kafkaCrc32           32  avgt   10      34.243 ±     1.066  
ns/op
[info] Crc32Bench.kafkaCrc32          128  avgt   10     114.481 ±     2.812  
ns/op
[info] Crc32Bench.kafkaCrc32         1024  avgt   10     835.630 ±    22.412  
ns/op
[info] Crc32Bench.kafkaCrc32        65536  avgt   10   52234.713 ±  2229.624  
ns/op
[info] Crc32Bench.kafkaCrc32      1048576  avgt   10  822903.613 ± 20950.560  
ns/op
{code}

JDK 7 update 80 results:

{code}
[info] Benchmark              (bytesSize)  Mode  Cnt       Score       Error  
Units
[info] Crc32Bench.jdkCrc32              8  avgt   10     114.802 ±     8.289  
ns/op
[info] Crc32Bench.jdkCrc32             16  avgt   10     122.030 ±     3.153  
ns/op
[info] Crc32Bench.jdkCrc32             32  avgt   10     131.082 ±     5.501  
ns/op
[info] Crc32Bench.jdkCrc32            128  avgt   10     154.116 ±     6.164  
ns/op
[info] Crc32Bench.jdkCrc32           1024  avgt   10     512.151 ±    15.934  
ns/op
[info] Crc32Bench.jdkCrc32          65536  avgt   10   25460.014 ±  1532.627  
ns/op
[info] Crc32Bench.jdkCrc32        1048576  avgt   10  401996.290 ± 18606.012  
ns/op
[info] Crc32Bench.kafkaCrc32            8  avgt   10      14.493 ±     0.494  
ns/op
[info] Crc32Bench.kafkaCrc32           16  avgt   10      20.329 ±     2.019  
ns/op
[info] Crc32Bench.kafkaCrc32           32  avgt   10      37.706 ±     0.338  
ns/op
[info] Crc32Bench.kafkaCrc32          128  avgt   10     124.197 ±     6.368  
ns/op
[info] Crc32Bench.kafkaCrc32         1024  avgt   10     908.327 ±    32.487  
ns/op
[info] Crc32Bench.kafkaCrc32        65536  avgt   10   57000.705 ±  2976.852  
ns/op
[info] Crc32Bench.kafkaCrc32      1048576  avgt   10  940433.528 ± 26257.962  
ns/op
{code}

Using a VM intrinsic avoids JNI set-up costs making JDK 8 much faster than JDK 
7 for small byte arrays. Having said that, Kafka's pure Java implementation is 
still faster for byte arrays of up to 128 bytes according to this benchmark. 
Surprisingly, the results are similar for JDK 7 and JDK 8 for larger byte 
arrays. I had a quick look at the assembly generated for JDK 8 and it seems to 
use AVX and CLMUL as per the OpenJDK commit I linked to. Unfortunately, it's a 
bit more work to look at the assembly generated by JDK 7 on a Mac and so I 
didn't. More investigation would be required to understand why this is so (and 
to be able to trust the numbers).

Looking at how we compute CRCs in `Record`, there are two different code paths 
depending on whether we call it from `Compressor` or not. The former invokes 
Crc32 update methods several times (both the byte array and int versions) while 
the latter invokes the byte array version once only.

To really understand the impact of this change, I think we need to benchmark 
the producer with varying message sizes with both implementations. 
[~becket_qin], how did you come up with the 2x as fast figure?


was (Author: ijuma):
[~becket_qin] We have started recommending Java 8 around the same time we 
released 0.9.0.0 (we also mention that LinkedIn is using Java 8 there):

http://kafka.apache.org/documentation.html#java

I did some investigation so that we understand the specifics of the improvement 
to CRC32 in the JDK. It relies on SSE 2, SSE 4.1, AVX and CLMUL. SSE has been 
available for a long time, CLMUL since Intel Westmere (2010) and AVX since 
Intel Sandy Bridge (2011). It's probably OK to assume that these instructions 
will be available for those who are constrained by CPU performance.

Note that this is not using CRC32 CPU instruction as we would have to use 
CRC32C for that (see KAFKA-1449 for more details on what is possible if we are 
willing to support CRC32C).

I wrote a simple JMH benchmark:

https://gist.github.com/ijuma/f86ad935715cfd4e258e

I tested it on my Ivy Bridge MacBook on JDK 7 update 80 and JDK 8 update 76, 
configuring JMH to use 10 one second measurement iterations, 10 one second 
warmup iterations and 1 fork.

JDK 8 update 76 results:

{code}
[info] Benchmark              (bytesSize)  Mode  Cnt       Score       Error  
Units
[info] Crc32Bench.jdkCrc32              8  avgt   10      24.902 ±     0.728  
ns/op
[info] Crc32Bench.jdkCrc32             16  avgt   10      48.819 ±     2.550  
ns/op
[info] Crc32Bench.jdkCrc32             32  avgt   10      83.434 ±     2.668  
ns/op
[info] Crc32Bench.jdkCrc32            128  avgt   10     127.679 ±     5.185  
ns/op
[info] Crc32Bench.jdkCrc32           1024  avgt   10     450.105 ±    18.943  
ns/op
[info] Crc32Bench.jdkCrc32          65536  avgt   10   25579.406 ±   683.017  
ns/op
[info] Crc32Bench.jdkCrc32        1048576  avgt   10  408708.242 ± 12183.543  
ns/op
[info] Crc32Bench.kafkaCrc32            8  avgt   10      14.761 ±     0.647  
ns/op
[info] Crc32Bench.kafkaCrc32           16  avgt   10      19.114 ±     0.423  
ns/op
[info] Crc32Bench.kafkaCrc32           32  avgt   10      34.243 ±     1.066  
ns/op
[info] Crc32Bench.kafkaCrc32          128  avgt   10     114.481 ±     2.812  
ns/op
[info] Crc32Bench.kafkaCrc32         1024  avgt   10     835.630 ±    22.412  
ns/op
[info] Crc32Bench.kafkaCrc32        65536  avgt   10   52234.713 ±  2229.624  
ns/op
[info] Crc32Bench.kafkaCrc32      1048576  avgt   10  822903.613 ± 20950.560  
ns/op
{code}

JDK 7 update 80 results:

{code}
[info] Benchmark              (bytesSize)  Mode  Cnt       Score       Error  
Units
[info] Crc32Bench.jdkCrc32              8  avgt   10     114.802 ±     8.289  
ns/op
[info] Crc32Bench.jdkCrc32             16  avgt   10     122.030 ±     3.153  
ns/op
[info] Crc32Bench.jdkCrc32             32  avgt   10     131.082 ±     5.501  
ns/op
[info] Crc32Bench.jdkCrc32            128  avgt   10     154.116 ±     6.164  
ns/op
[info] Crc32Bench.jdkCrc32           1024  avgt   10     512.151 ±    15.934  
ns/op
[info] Crc32Bench.jdkCrc32          65536  avgt   10   25460.014 ±  1532.627  
ns/op
[info] Crc32Bench.jdkCrc32        1048576  avgt   10  401996.290 ± 18606.012  
ns/op
[info] Crc32Bench.kafkaCrc32            8  avgt   10      14.493 ±     0.494  
ns/op
[info] Crc32Bench.kafkaCrc32           16  avgt   10      20.329 ±     2.019  
ns/op
[info] Crc32Bench.kafkaCrc32           32  avgt   10      37.706 ±     0.338  
ns/op
[info] Crc32Bench.kafkaCrc32          128  avgt   10     124.197 ±     6.368  
ns/op
[info] Crc32Bench.kafkaCrc32         1024  avgt   10     908.327 ±    32.487  
ns/op
[info] Crc32Bench.kafkaCrc32        65536  avgt   10   57000.705 ±  2976.852  
ns/op
[info] Crc32Bench.kafkaCrc32      1048576  avgt   10  940433.528 ± 26257.962  
ns/op
{code}

Using a VM intrinsic avoids JNI set-up costs making JDK 8 much faster than JDK 
7 for small byte arrays. Having said that, Kafka's pure Java implementation is 
still faster for byte arrays of up to 128 bytes according to this benchmark. 
Surprisingly, the results are similar for JDK 7 and JDK 8 for larger byte 
arrays. I had a quick look at the assembly generated for JDK 8 and it seems to 
use AVX and CLMUL as per the OpenJDK commit I linked to. Unfortunately, it's a 
bit more work to look at the assembly generated by JDK 7 on a Mac and so I 
didn't. More investigation would be required to understand why this is so (and 
to be able to trust the numbers).

Looking at how we compute CRCs in `Record`, there are two different code paths 
depending on whether we call it from `Compressor` or not. The former invokes 
Crc32 update methods several times (both the byte array and int versions) while 
the latter invokes the byte array version once only.

To really understand the impact of this change, I think we need to benchmark 
the producer with varying message sizes with both implementations. 
[~becket_qin], how do you come up with the 2x as fast figure?

> Re-evaluate the CRC32 class performance.
> ----------------------------------------
>
>                 Key: KAFKA-3174
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3174
>             Project: Kafka
>          Issue Type: Improvement
>    Affects Versions: 0.9.0.0
>            Reporter: Jiangjie Qin
>            Assignee: Jiangjie Qin
>             Fix For: 0.9.0.1
>
>
> We used org.apache.kafka.common.utils.CRC32 in clients because it has better 
> performance than java.util.zip.CRC32 in Java 1.6.
> In a recent test I ran it looks in Java 1.8 the CRC32 class is 2x as fast as 
> the Crc32 class we are using now. We may want to re-evaluate the performance 
> of Crc32 class and see it makes sense to simply use java CRC32 instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to