[ https://issues.apache.org/jira/browse/KAFKA-527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14351177#comment-14351177 ]
Guozhang Wang commented on KAFKA-527: ------------------------------------- Thanks for the patch, this is very promising. There are a couple of issues we want to resolve here: 1. ByteArrayOutputStream copies data upon overflowing and resizing. 2. Compressed stream needs one extra copy upon finishing reading / writing. This patch is mainly aimed at #1 above, and I have uploaded a patch for optimizing decompressed iterator, just as an example for resolving #2. In addition, I think in the end we will deprecate ByeBufferMessageSet and move to o.a.k.c.r.MemoryRecords, which will resolve both points above. We can discuss whether we want to incorporate these patches into ByeBufferMessageSet now or just wait for the migration and improve on o.a.k.c.r.MemoryRecords. For example, today MemoryRecords's write pattern is only for appending messages with pre-defined "records batch size", and try to close the batch when its size is approached; in ByteBufferMessageSet.create() we are given a set of messages without a predicated batch size, but it is still possible to get the value from the estimated compression ratio as we do in Compressor, such that in the worst case only one or two buffer expansions (i.e. data copies) are needed. Just is just an alternative to the linked-list buffers as proposed in this patch. > Compression support does numerous byte copies > --------------------------------------------- > > Key: KAFKA-527 > URL: https://issues.apache.org/jira/browse/KAFKA-527 > Project: Kafka > Issue Type: Bug > Components: compression > Reporter: Jay Kreps > Assignee: Yasuhiro Matsuda > Priority: Critical > Attachments: KAFKA-527.message-copy.history, KAFKA-527.patch, > java.hprof.no-compression.txt, java.hprof.snappy.text > > > The data path for compressing or decompressing messages is extremely > inefficient. We do something like 7 (?) complete copies of the data, often > for simple things like adding a 4 byte size to the front. I am not sure how > this went by unnoticed. > This is likely the root cause of the performance issues we saw in doing bulk > recompression of data in mirror maker. > The mismatch between the InputStream and OutputStream interfaces and the > Message/MessageSet interfaces which are based on byte buffers is the cause of > many of these. -- This message was sent by Atlassian JIRA (v6.3.4#6332)