[ https://issues.apache.org/jira/browse/KAFKA-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13906407#comment-13906407 ]
Jay Kreps commented on KAFKA-1253: ---------------------------------- This will be tricky but is possible. Here are a couple pointers: 1. ByteBuffer.array will give the backing array for bytebuffers so we can work with apis that only accept arrays 2. GZIPOutputStream requires a stream. Two options: a. Make an OutputStream implementation based on ByteBuffer. ByteArrayOutputStream would work but it will be tricky because you would have to do new ByteArrayOutputStream(size) then use the toByteArray() method to get the backing array and use ByteBuffer.wrap() on that array to create the ByteBuffer. b. Directly use the Deflate compression code java provides and what ByteArrayOutputStream uses under the covers. This is a better api but there are some subtly differences between GZIPOutputStream hacks around to get and we would have to do similar hacking. 3. There are two snappy libraries: we currently use the JNI wrapper for the google native code, but there is also a pure java impl. Ideally either way snappy should not be a runtime dependency unless you enable snappy compression. This will mean not instantiating the classes in the snappy jar unless they are needed. 4. The desired end result here is that our performance on compressed messages is comparable to the underlying compression codec and not artificially limited by lots and lots of byte copying (e.g. see http://grokbase.com/t/kafka/users/1383bcfkym/compression-performance). For example snappy claims performance on the order of hundreds of mb/sec. So it would be good to make a stand-alone main method that runs the message compression to create compressed messaged and benchmark the performance as well as look at it in hprof to ensure the time is actually going to compression. This performance will be particularly important on the server side where we need to both decompress and recompress and where compression is a big bottleneck. > Implement compression in new producer > ------------------------------------- > > Key: KAFKA-1253 > URL: https://issues.apache.org/jira/browse/KAFKA-1253 > Project: Kafka > Issue Type: Sub-task > Components: producer > Reporter: Jay Kreps > -- This message was sent by Atlassian JIRA (v6.1.5#6160)