Hey Becket,

I get the problem we want to solve with this, but I don't think this is
something that makes sense as a user controlled knob that everyone sending
data to kafka has to think about. It is basically a bug, right?

First, as a technical question is it true that using the uncompressed size
for batching actually guarantees that you observe the limit? I think that
implies that compression always makes the messages smaller, which i think
usually true but is not guaranteed, right? e.g. if someone encrypts their
data which tends to randomize it and then enables compressesion, it could
slightly get bigger?

I also wonder if the rejected alternatives you describe couldn't be made to
work: basically try to be a bit better at estimation and recover when we
guess wrong. I don't think the memory usage should be a problem: isn't it
the same memory usage the consumer of that topic would need? And can't you
do the splitting and recompression in a streaming fashion? If we an make
the estimation rate low and the recovery cost is just ~2x the normal cost
for that batch that should be totally fine, right? (It's technically true
you might have to split more than once, but since you halve it each time I
think should you get a number of halvings that is logarithmic in the miss
size, which, with better estimation you'd hope would be super duper small).

Alternatively maybe we could work on the other side of the problem and try
to make it so that a small miss on message size isn't a big problem. I
think original issue was that max size and fetch size were tightly coupled
and the way memory in the consumer worked you really wanted fetch size to
be as small as possible because you'd use that much memory per fetched
partition and the consumer would get stuck if its fetch size wasn't big
enough. I think we made some progress on that issue and maybe more could be
done there so that a small bit of fuzziness around the size would not be an
issue?

-Jay



On Tue, Feb 21, 2017 at 12:30 PM, Becket Qin <becket....@gmail.com> wrote:

> Hi folks,
>
> I would like to start the discussion thread on KIP-126. The KIP propose
> adding a new configuration to KafkaProducer to allow batching based on
> uncompressed message size.
>
> Comments are welcome.
>
> The KIP wiki is following:
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 126+-+Allow+KafkaProducer+to+batch+based+on+uncompressed+size
>
> Thanks,
>
> Jiangjie (Becket) Qin
>

Reply via email to