Hey Becket, I get the problem we want to solve with this, but I don't think this is something that makes sense as a user controlled knob that everyone sending data to kafka has to think about. It is basically a bug, right?
First, as a technical question is it true that using the uncompressed size for batching actually guarantees that you observe the limit? I think that implies that compression always makes the messages smaller, which i think usually true but is not guaranteed, right? e.g. if someone encrypts their data which tends to randomize it and then enables compressesion, it could slightly get bigger? I also wonder if the rejected alternatives you describe couldn't be made to work: basically try to be a bit better at estimation and recover when we guess wrong. I don't think the memory usage should be a problem: isn't it the same memory usage the consumer of that topic would need? And can't you do the splitting and recompression in a streaming fashion? If we an make the estimation rate low and the recovery cost is just ~2x the normal cost for that batch that should be totally fine, right? (It's technically true you might have to split more than once, but since you halve it each time I think should you get a number of halvings that is logarithmic in the miss size, which, with better estimation you'd hope would be super duper small). Alternatively maybe we could work on the other side of the problem and try to make it so that a small miss on message size isn't a big problem. I think original issue was that max size and fetch size were tightly coupled and the way memory in the consumer worked you really wanted fetch size to be as small as possible because you'd use that much memory per fetched partition and the consumer would get stuck if its fetch size wasn't big enough. I think we made some progress on that issue and maybe more could be done there so that a small bit of fuzziness around the size would not be an issue? -Jay On Tue, Feb 21, 2017 at 12:30 PM, Becket Qin <becket....@gmail.com> wrote: > Hi folks, > > I would like to start the discussion thread on KIP-126. The KIP propose > adding a new configuration to KafkaProducer to allow batching based on > uncompressed message size. > > Comments are welcome. > > The KIP wiki is following: > https://cwiki.apache.org/confluence/display/KAFKA/KIP- > 126+-+Allow+KafkaProducer+to+batch+based+on+uncompressed+size > > Thanks, > > Jiangjie (Becket) Qin >