[ https://issues.apache.org/jira/browse/KAFKA-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15126346#comment-15126346 ]
Ismael Juma edited comment on KAFKA-3174 at 2/1/16 3:02 PM: ------------------------------------------------------------ [~becket_qin] We have started recommending Java 8 around the same time we released 0.9.0.0 (we also mention that LinkedIn is using Java 8 there): http://kafka.apache.org/documentation.html#java I did some investigation so that we understand the specifics of the improvement to CRC32 in the JDK. It relies on SSE 2, SSE 4.1, AVX and CLMUL. SSE has been available for a long time, CLMUL since Intel Westmere (2010) and AVX since Intel Sandy Bridge (2011). It's probably OK to assume that these instructions will be available for those who are constrained by CPU performance. Note that this is not using CRC32 CPU instruction as we would have to use CRC32C for that (see KAFKA-1449 for more details on what is possible if we are willing to support CRC32C). I wrote a simple JMH benchmark: https://gist.github.com/ijuma/f86ad935715cfd4e258e I tested it on my Ivy Bridge MacBook on JDK 7 update 80 and JDK 8 update 76, configuring JMH to use 10 one second measurement iterations, 10 one second warmup iterations and 1 fork. JDK 8 update 76 results: {code} [info] Benchmark (bytesSize) Mode Cnt Score Error Units [info] Crc32Bench.jdkCrc32 8 avgt 10 24.902 ± 0.728 ns/op [info] Crc32Bench.jdkCrc32 16 avgt 10 48.819 ± 2.550 ns/op [info] Crc32Bench.jdkCrc32 32 avgt 10 83.434 ± 2.668 ns/op [info] Crc32Bench.jdkCrc32 128 avgt 10 127.679 ± 5.185 ns/op [info] Crc32Bench.jdkCrc32 1024 avgt 10 450.105 ± 18.943 ns/op [info] Crc32Bench.jdkCrc32 65536 avgt 10 25579.406 ± 683.017 ns/op [info] Crc32Bench.jdkCrc32 1048576 avgt 10 408708.242 ± 12183.543 ns/op [info] Crc32Bench.kafkaCrc32 8 avgt 10 14.761 ± 0.647 ns/op [info] Crc32Bench.kafkaCrc32 16 avgt 10 19.114 ± 0.423 ns/op [info] Crc32Bench.kafkaCrc32 32 avgt 10 34.243 ± 1.066 ns/op [info] Crc32Bench.kafkaCrc32 128 avgt 10 114.481 ± 2.812 ns/op [info] Crc32Bench.kafkaCrc32 1024 avgt 10 835.630 ± 22.412 ns/op [info] Crc32Bench.kafkaCrc32 65536 avgt 10 52234.713 ± 2229.624 ns/op [info] Crc32Bench.kafkaCrc32 1048576 avgt 10 822903.613 ± 20950.560 ns/op {code} JDK 7 update 80 results: {code} [info] Benchmark (bytesSize) Mode Cnt Score Error Units [info] Crc32Bench.jdkCrc32 8 avgt 10 114.802 ± 8.289 ns/op [info] Crc32Bench.jdkCrc32 16 avgt 10 122.030 ± 3.153 ns/op [info] Crc32Bench.jdkCrc32 32 avgt 10 131.082 ± 5.501 ns/op [info] Crc32Bench.jdkCrc32 128 avgt 10 154.116 ± 6.164 ns/op [info] Crc32Bench.jdkCrc32 1024 avgt 10 512.151 ± 15.934 ns/op [info] Crc32Bench.jdkCrc32 65536 avgt 10 25460.014 ± 1532.627 ns/op [info] Crc32Bench.jdkCrc32 1048576 avgt 10 401996.290 ± 18606.012 ns/op [info] Crc32Bench.kafkaCrc32 8 avgt 10 14.493 ± 0.494 ns/op [info] Crc32Bench.kafkaCrc32 16 avgt 10 20.329 ± 2.019 ns/op [info] Crc32Bench.kafkaCrc32 32 avgt 10 37.706 ± 0.338 ns/op [info] Crc32Bench.kafkaCrc32 128 avgt 10 124.197 ± 6.368 ns/op [info] Crc32Bench.kafkaCrc32 1024 avgt 10 908.327 ± 32.487 ns/op [info] Crc32Bench.kafkaCrc32 65536 avgt 10 57000.705 ± 2976.852 ns/op [info] Crc32Bench.kafkaCrc32 1048576 avgt 10 940433.528 ± 26257.962 ns/op {code} Using a VM intrinsic avoids JNI set-up costs making JDK 8 much faster than JDK 7 for small byte arrays. Having said that, Kafka's pure Java implementation is still faster for byte arrays of up to 128 bytes according to this benchmark. Surprisingly, the results are similar for JDK 7 and JDK 8 for larger byte arrays. I had a quick look at the assembly generated for JDK 8 and it seems to use AVX and CLMUL as per the OpenJDK commit I linked to. Unfortunately, it's a bit more work to look at the assembly generated by JDK 7 on a Mac and so I didn't. More investigation would be required to understand why this is so (and to be able to trust the numbers). Looking at how we compute CRCs in `Record`, there are two different code paths depending on whether we call it from `Compressor` or not. The former invokes Crc32 update methods several times (both the byte array and int versions) while the latter invokes the byte array version once only. To really understand the impact of this change, I think we need to benchmark the producer with varying message sizes with both implementations. [~becket_qin], how did you come up with the 2x as fast figure? was (Author: ijuma): [~becket_qin] We have started recommending Java 8 around the same time we released 0.9.0.0 (we also mention that LinkedIn is using Java 8 there): http://kafka.apache.org/documentation.html#java I did some investigation so that we understand the specifics of the improvement to CRC32 in the JDK. It relies on SSE 2, SSE 4.1, AVX and CLMUL. SSE has been available for a long time, CLMUL since Intel Westmere (2010) and AVX since Intel Sandy Bridge (2011). It's probably OK to assume that these instructions will be available for those who are constrained by CPU performance. Note that this is not using CRC32 CPU instruction as we would have to use CRC32C for that (see KAFKA-1449 for more details on what is possible if we are willing to support CRC32C). I wrote a simple JMH benchmark: https://gist.github.com/ijuma/f86ad935715cfd4e258e I tested it on my Ivy Bridge MacBook on JDK 7 update 80 and JDK 8 update 76, configuring JMH to use 10 one second measurement iterations, 10 one second warmup iterations and 1 fork. JDK 8 update 76 results: {code} [info] Benchmark (bytesSize) Mode Cnt Score Error Units [info] Crc32Bench.jdkCrc32 8 avgt 10 24.902 ± 0.728 ns/op [info] Crc32Bench.jdkCrc32 16 avgt 10 48.819 ± 2.550 ns/op [info] Crc32Bench.jdkCrc32 32 avgt 10 83.434 ± 2.668 ns/op [info] Crc32Bench.jdkCrc32 128 avgt 10 127.679 ± 5.185 ns/op [info] Crc32Bench.jdkCrc32 1024 avgt 10 450.105 ± 18.943 ns/op [info] Crc32Bench.jdkCrc32 65536 avgt 10 25579.406 ± 683.017 ns/op [info] Crc32Bench.jdkCrc32 1048576 avgt 10 408708.242 ± 12183.543 ns/op [info] Crc32Bench.kafkaCrc32 8 avgt 10 14.761 ± 0.647 ns/op [info] Crc32Bench.kafkaCrc32 16 avgt 10 19.114 ± 0.423 ns/op [info] Crc32Bench.kafkaCrc32 32 avgt 10 34.243 ± 1.066 ns/op [info] Crc32Bench.kafkaCrc32 128 avgt 10 114.481 ± 2.812 ns/op [info] Crc32Bench.kafkaCrc32 1024 avgt 10 835.630 ± 22.412 ns/op [info] Crc32Bench.kafkaCrc32 65536 avgt 10 52234.713 ± 2229.624 ns/op [info] Crc32Bench.kafkaCrc32 1048576 avgt 10 822903.613 ± 20950.560 ns/op {code} JDK 7 update 80 results: {code} [info] Benchmark (bytesSize) Mode Cnt Score Error Units [info] Crc32Bench.jdkCrc32 8 avgt 10 114.802 ± 8.289 ns/op [info] Crc32Bench.jdkCrc32 16 avgt 10 122.030 ± 3.153 ns/op [info] Crc32Bench.jdkCrc32 32 avgt 10 131.082 ± 5.501 ns/op [info] Crc32Bench.jdkCrc32 128 avgt 10 154.116 ± 6.164 ns/op [info] Crc32Bench.jdkCrc32 1024 avgt 10 512.151 ± 15.934 ns/op [info] Crc32Bench.jdkCrc32 65536 avgt 10 25460.014 ± 1532.627 ns/op [info] Crc32Bench.jdkCrc32 1048576 avgt 10 401996.290 ± 18606.012 ns/op [info] Crc32Bench.kafkaCrc32 8 avgt 10 14.493 ± 0.494 ns/op [info] Crc32Bench.kafkaCrc32 16 avgt 10 20.329 ± 2.019 ns/op [info] Crc32Bench.kafkaCrc32 32 avgt 10 37.706 ± 0.338 ns/op [info] Crc32Bench.kafkaCrc32 128 avgt 10 124.197 ± 6.368 ns/op [info] Crc32Bench.kafkaCrc32 1024 avgt 10 908.327 ± 32.487 ns/op [info] Crc32Bench.kafkaCrc32 65536 avgt 10 57000.705 ± 2976.852 ns/op [info] Crc32Bench.kafkaCrc32 1048576 avgt 10 940433.528 ± 26257.962 ns/op {code} Using a VM intrinsic avoids JNI set-up costs making JDK 8 much faster than JDK 7 for small byte arrays. Having said that, Kafka's pure Java implementation is still faster for byte arrays of up to 128 bytes according to this benchmark. Surprisingly, the results are similar for JDK 7 and JDK 8 for larger byte arrays. I had a quick look at the assembly generated for JDK 8 and it seems to use AVX and CLMUL as per the OpenJDK commit I linked to. Unfortunately, it's a bit more work to look at the assembly generated by JDK 7 on a Mac and so I didn't. More investigation would be required to understand why this is so (and to be able to trust the numbers). Looking at how we compute CRCs in `Record`, there are two different code paths depending on whether we call it from `Compressor` or not. The former invokes Crc32 update methods several times (both the byte array and int versions) while the latter invokes the byte array version once only. To really understand the impact of this change, I think we need to benchmark the producer with varying message sizes with both implementations. [~becket_qin], how do you come up with the 2x as fast figure? > Re-evaluate the CRC32 class performance. > ---------------------------------------- > > Key: KAFKA-3174 > URL: https://issues.apache.org/jira/browse/KAFKA-3174 > Project: Kafka > Issue Type: Improvement > Affects Versions: 0.9.0.0 > Reporter: Jiangjie Qin > Assignee: Jiangjie Qin > Fix For: 0.9.0.1 > > > We used org.apache.kafka.common.utils.CRC32 in clients because it has better > performance than java.util.zip.CRC32 in Java 1.6. > In a recent test I ran it looks in Java 1.8 the CRC32 class is 2x as fast as > the Crc32 class we are using now. We may want to re-evaluate the performance > of Crc32 class and see it makes sense to simply use java CRC32 instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)