[ https://issues.apache.org/jira/browse/KAFKA-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998466#comment-13998466 ]
Albert Strasheim edited comment on KAFKA-1449 at 5/15/14 5:50 AM: ------------------------------------------------------------------ Attributes in Message has 6 bits to play with. Maybe if the 3rd bit is set, it could mean that the Crc field is a CRC32C instead of a CRC32. New producers producing messages for new consumers could choose to set the third bit and use CRC32C instead of CRC32. New consumers will check for the bit being set and verify the checksum as CRC32C instead of CRC32. A new producer for a stream with old consumers should produce CRC32 messages. If it doesn't, old consumers not checking for the 3rd bit being set will verify the CRC32C checksum as a CRC32, which should fail, which seems like a good enough outcome. Old consumers can't read data from new producers that choose to use the new checksum. Old producers continue to use CRC32 and new consumers will continue to verify with CRC32 since the 3rd bit should be zero according to the protocol spec, which will be slow. To go fast, upgrade your producer code first and start using CRC32C there. If Kafka does any internal verification, it will have to be updated to check for the 3rd bit being set and verify the checksum accordingly. Did I miss anything? P.S. This is probably a separate JIRA, but for compression, it would be nice to support LZ4 too. LZ4 and LZ4HC have compression speeds similar to Snappy, but much better decompression speeds. Benchmarks here: https://code.google.com/p/lz4/ Keeping this in mind, maybe the 3rd bit should also be reserved for evolving the available compression algorithms and the 4th bit can be for CRC32/CRC32C? was (Author: fullung): Attributes in Message has 6 bits to play with. Maybe if the 3rd bit is set, it could mean that the Crc field is a CRC32C instead of a CRC32. New producers producing messages for new consumers could choose to set the third bit and use CRC32C instead of CRC32. New consumers will check for the bit being set and verify the checksum as CRC32C instead of CRC32. A new producer for a stream with old consumers should produce CRC32 messages. If it doesn't, old consumers not checking for the 3rd bit being set will verify the CRC32C checksum as a CRC32, which should fail, which seems like a good enough outcome. Old consumers can't read data from new producers that choose to use the new checksum. Old producers continue to use CRC32 and new consumers will continue to verify with CRC32, which will be slow. To fix this, upgrade your producer code. If Kafka does any internal verification, it will have to be updated to check for the 3rd bit being set and verify the checksum accordingly. Did I miss anything? P.S. This is probably a separate JIRA, but for compression, it would be nice to support LZ4 too. LZ4 and LZ4HC have compression speeds similar to Snappy, but much better decompression speeds. Benchmarks here: https://code.google.com/p/lz4/ Keeping this in mind, maybe the 3rd bit should also be reserved for evolving the available compression algorithms and the 4th bit can be for CRC32/CRC32C? > Extend wire protocol to allow CRC32C > ------------------------------------ > > Key: KAFKA-1449 > URL: https://issues.apache.org/jira/browse/KAFKA-1449 > Project: Kafka > Issue Type: Improvement > Components: consumer > Reporter: Albert Strasheim > Assignee: Neha Narkhede > Fix For: 0.9.0 > > > Howdy > We are currently building out a number of Kafka consumers in Go, based on a > patched version of the Sarama library that Shopify released a while back. > We have a reasonably fast serialization protocol (Cap'n Proto), a 10G network > and lots of cores. We have various consumers computing all kinds of > aggregates on a reasonably high volume access log stream (1.1e6 messages/sec > peak, about 500-600 bytes per message uncompressed). > When profiling our consumer, our single hottest function (until we disabled > it), was the CRC32 checksum validation, since the deserialization and > aggregation in these consumers is pretty cheap. > We believe things could be improved by extending the wire protocol to support > CRC-32C (Castagnoli), since SSE 4.2 has an instruction to accelerate its > calculation. > https://en.wikipedia.org/wiki/SSE4#SSE4.2 > It might be hard to use from Java, but consumers written in most other > languages will benefit a lot. > To give you an idea, here are some benchmarks for the Go CRC32 functions > running on a Intel(R) Core(TM) i7-3540M CPU @ 3.00GHz core: > BenchmarkCrc32KB 90196 ns/op 363.30 MB/s > BenchmarkCrcCastagnoli32KB 3404 ns/op 9624.42 MB/s > I believe BenchmarkCrc32 written in C would do about 600-700 MB/sec, and the > CRC32-C speed should be close to what one achieves in Go. > (Met Todd and Clark at the meetup last night. Thanks for the great > presentation!) -- This message was sent by Atlassian JIRA (v6.2#6252)