[ 
https://issues.apache.org/jira/browse/KAFKA-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13906407#comment-13906407
 ] 

Jay Kreps commented on KAFKA-1253:
----------------------------------

This will be tricky but is possible.

Here are a couple pointers:
1. ByteBuffer.array will give the backing array for bytebuffers so we can work 
with apis that only accept arrays
2. GZIPOutputStream requires a stream. Two options:
a. Make an OutputStream implementation based on ByteBuffer. 
ByteArrayOutputStream would work but it will be tricky because you would have 
to do new ByteArrayOutputStream(size) then use the toByteArray() method to get 
the backing array and use ByteBuffer.wrap() on that array to create the 
ByteBuffer.
b. Directly use the Deflate compression code java provides and what 
ByteArrayOutputStream uses under the covers. This is a better api but there are 
some subtly differences between GZIPOutputStream hacks around to get and we 
would have to do similar hacking.
3. There are two snappy libraries: we currently use the JNI wrapper for the 
google native code, but there is also a pure java impl. Ideally either way 
snappy should not be a runtime dependency unless you enable snappy compression. 
This will mean not instantiating the classes in the snappy jar unless they are 
needed.
4. The desired end result here is that our performance on compressed messages 
is comparable to the underlying compression codec and not artificially limited 
by lots and lots of byte copying (e.g. see 
http://grokbase.com/t/kafka/users/1383bcfkym/compression-performance). For 
example snappy claims performance on the order of hundreds of mb/sec. So it 
would be good to make a stand-alone main method that runs the message 
compression to create compressed messaged and benchmark the performance as well 
as look at it in hprof to ensure the time is actually going to compression. 
This performance will be particularly important on the server side where we 
need to both decompress and recompress and where compression is a big 
bottleneck.

> Implement compression in new producer
> -------------------------------------
>
>                 Key: KAFKA-1253
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1253
>             Project: Kafka
>          Issue Type: Sub-task
>          Components: producer 
>            Reporter: Jay Kreps
>




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to