Hey Becket, Thanks for the KIP. I have one question here.
Suppose producer's batch.size=100 KB, max.in.flight.requests.per.connection=1. Since each ProduceRequest contains one batch per partition, it means that 100 KB compressed data will be produced per partition per round-trip time as of current implementation. If we disable compression estimation with this KIP, then produce can only produce 100 KB uncompressed data per partition per round-trip time. Suppose average compression ratio is 10, then there will be 10X difference in the bytes that are transmitted per round-trip time. The impact on the throughput can be big if mirror maker is producer to a remote cluster, even though the compression ratio may be the same. Given this observation, we should probably note in the KIP that user should bump up the producer's batch.size to the message.max.bytes configured on the broker, which by default is roughly 1MB, to achieve maximum possible throughput when compression estimation is disabled. Still, this can impact throughput of producer or MM that are producing highly compressible data. I think we can get around this problem by allowing each request to have multiple batches per partition as long as the size of these batches <= producer's batch.size config. Do you think it is worth doing? Thanks, Dong On Tue, Feb 21, 2017 at 7:56 PM, Mayuresh Gharat <gharatmayures...@gmail.com > wrote: > Apurva has a point that can be documented for this config. > > Overall, LGTM +1. > > Thanks, > > Mayuresh > > On Tue, Feb 21, 2017 at 6:41 PM, Becket Qin <becket....@gmail.com> wrote: > > > Hi Apurva, > > > > Yes, it is true that the request size might be much smaller if the > batching > > is based on uncompressed size. I will let the users know about this. That > > said, in practice, this is probably fine. For example, at LinkedIn, our > max > > message size is 1 MB, typically the compressed size would be 100 KB or > > larger, given that in most cases, there are many partitions, the request > > size would not be too small (typically around a few MB). > > > > At LinkedIn we do have some topics has various compression ratio. Those > are > > usually topics shared by different services so the data may differ a lot > > although they are in the same topic and similar fields. > > > > Thanks, > > > > Jiangjie (Becket) Qin > > > > > > On Tue, Feb 21, 2017 at 6:17 PM, Apurva Mehta <apu...@confluent.io> > wrote: > > > > > Hi Becket, Thanks for the kip. > > > > > > I think one of the risks here is that when compression estimation is > > > disabled, you could have much smaller batches than expected, and > > throughput > > > could be hurt. It would be worth adding this to the documentation of > this > > > setting. > > > > > > Also, one of the rejected alternatives states that per topic > estimations > > > would not work when the compression of individual messages is variable. > > > This is true in theory, but in practice one would expect Kafka topics > to > > > have fairly homogenous data, and hence should compress evenly. I was > > > curious if you have data which shows otherwise. > > > > > > Thanks, > > > Apurva > > > > > > On Tue, Feb 21, 2017 at 12:30 PM, Becket Qin <becket....@gmail.com> > > wrote: > > > > > > > Hi folks, > > > > > > > > I would like to start the discussion thread on KIP-126. The KIP > propose > > > > adding a new configuration to KafkaProducer to allow batching based > on > > > > uncompressed message size. > > > > > > > > Comments are welcome. > > > > > > > > The KIP wiki is following: > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP- > > > > 126+-+Allow+KafkaProducer+to+batch+based+on+uncompressed+size > > > > > > > > Thanks, > > > > > > > > Jiangjie (Becket) Qin > > > > > > > > > > > > > -- > -Regards, > Mayuresh R. Gharat > (862) 250-7125 >