Hi devs, Bump this thread. Call for vote for: KIP-782: Expandable batch size in producer.
The main goal for this KIP is: 1. higher throughput in producer 2. better memory usage in producer Detailed description can be found here: https://cwiki.apache.org/confluence/display/KAFKA/KIP-782%3A+Expandable+batch+size+in+producer Any feedback and comments is welcome. Thank you. Luke On Fri, Nov 5, 2021 at 4:37 PM Luke Chen <show...@gmail.com> wrote: > Hi Mickael, > Thanks for the good comments! Answering them below: > > - When under load, the producer may allocate extra buffers. Are these > buffers ever released if the load drops? > --> This is a good point that I've never considered before. Yes, after > introducing the "batch.max.size", we should release some buffer out of the > buffer pools. In this KIP, we'll only keep maximum "batch.size" into pool, > and mark the rest of memory as free to use. The reason we keep maximum > "batch.size" back to pool is because the semantic of "batch.size" is the > batch full limit. In most cases, the batch.size should be able to contain > the records to be sent within linger.ms time. > > - Do we really need batch.initial.size? It's not clear that having this > extra setting adds a lot of value. > --> I think "batch.initial.size" is important to achieve higher memory > usage. Now, I made the default value to 4KB, so after upgrading to the new > release, the producer memory usage will become better. > > I've updated the KIP. > > Thank you. > Luke > > On Wed, Nov 3, 2021 at 6:44 PM Mickael Maison <mickael.mai...@gmail.com> > wrote: > >> Hi Luke, >> >> Thanks for the KIP. It looks like an interesting idea. I like the >> concept of dynamically adjusting settings to handle load. I wonder if >> other client settings could also benefit from a similar logic. >> >> Just a couple of questions: >> - When under load, the producer may allocate extra buffers. Are these >> buffers ever released if the load drops? >> - Do we really need batch.initial.size? It's not clear that having >> this extra setting adds a lot of value. >> >> Thanks, >> Mickael >> >> On Tue, Oct 26, 2021 at 11:12 AM Luke Chen <show...@gmail.com> wrote: >> > >> > Thank you, Artem! >> > >> > @devs, welcome to vote for this KIP. >> > Key proposal: >> > 1. allocate multiple smaller initial batch size buffer in producer, and >> > list them together when expansion for better memory usage >> > 2. add a max batch size config in producer, so when producer rate is >> > suddenly high, we can still have high throughput with batch size larger >> > than "batch.size" (and less than "batch.max.size", where "batch.size" is >> > soft limit and "batch.max.size" is hard limit) >> > Here's the updated KIP: >> > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-782%3A+Expandable+batch+size+in+producer >> > >> > And, any comments and feedback are welcome. >> > >> > Thank you. >> > Luke >> > >> > On Tue, Oct 26, 2021 at 6:35 AM Artem Livshits >> > <alivsh...@confluent.io.invalid> wrote: >> > >> > > Hi Luke, >> > > >> > > I've looked at the updated KIP-782, it looks good to me. >> > > >> > > -Artem >> > > >> > > On Sun, Oct 24, 2021 at 1:46 AM Luke Chen <show...@gmail.com> wrote: >> > > >> > > > Hi Artem, >> > > > Thanks for your good suggestion again. >> > > > I've combined your idea into this KIP, and updated it. >> > > > Note, in the end, I still keep the "batch.initial.size" config >> (default >> > > is >> > > > 0, which means "batch.size" will be initial batch size) for better >> memory >> > > > conservation. >> > > > >> > > > Detailed description can be found here: >> > > > >> > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-782%3A+Expandable+batch+size+in+producer >> > > > >> > > > Let me know if you have other suggestions. >> > > > >> > > > Thank you. >> > > > Luke >> > > > >> > > > On Sat, Oct 23, 2021 at 10:50 AM Luke Chen <show...@gmail.com> >> wrote: >> > > > >> > > >> Hi Artem, >> > > >> Thanks for the suggestion. Let me confirm my understanding is >> correct. >> > > >> So, what you suggest is that the "batch.size" is more like a "soft >> > > limit" >> > > >> batch size, and the "hard limit" is "batch.max.size". When >> reaching the >> > > >> batch.size of the buffer, it means the buffer is "ready" to be be >> sent. >> > > But >> > > >> before the linger.ms reached, if there are more data coming, we >> can >> > > >> still accumulate it into the same buffer, until it reached the >> > > >> "batch.max.size". After it reached the "batch.max.size", we'll >> create >> > > >> another batch for it. >> > > >> >> > > >> So after your suggestion, we won't need the "batch.initial.size", >> and we >> > > >> can use "batch.size" as the initial batch size. We list each >> > > "batch.size" >> > > >> together, until it reached "batch.max.size". Something like this: >> > > >> >> > > >> [image: image.png] >> > > >> Is my understanding correct? >> > > >> If so, that sounds good to me. >> > > >> If not, please kindly explain more to me. >> > > >> >> > > >> Thank you. >> > > >> Luke >> > > >> >> > > >> >> > > >> >> > > >> >> > > >> On Sat, Oct 23, 2021 at 2:13 AM Artem Livshits >> > > >> <alivsh...@confluent.io.invalid> wrote: >> > > >> >> > > >>> Hi Luke, >> > > >>> >> > > >>> Nice suggestion. It should optimize how memory is used with >> different >> > > >>> production rates, but I wonder if we can take this idea further >> and >> > > >>> improve >> > > >>> batching in general. >> > > >>> >> > > >>> Currently batch.size is used in two conditions: >> > > >>> >> > > >>> 1. When we append records to a batch in the accumulator, we >> create a >> > > new >> > > >>> batch if the current batch would exceed the batch.size. >> > > >>> 2. When we drain the batch from the accumulator, a batch becomes >> > > 'ready' >> > > >>> when it reaches batch.size. >> > > >>> >> > > >>> The second condition is good with the current batch size, because >> if >> > > >>> linger.ms is greater than 0, the send can be triggered by >> > > accomplishing >> > > >>> the >> > > >>> batching goal. >> > > >>> >> > > >>> The first condition, though, leads to creating many batches if the >> > > >>> network >> > > >>> latency or production rate (or both) is high, and with 5 >> in-flight and >> > > >>> 16KB >> > > >>> batches we can only have 80KB of data in-flight per partition. >> Which >> > > >>> means >> > > >>> that with 50ms latency, we can only push ~1.6MB/sec per partition >> (this >> > > >>> goes down if we consider higher latencies, e.g. with 100ms we can >> only >> > > >>> push >> > > >>> ~0.8MB/sec). >> > > >>> >> > > >>> I think it would be great to separate the two sizes: >> > > >>> >> > > >>> 1. When appending records to a batch, create a new batch if the >> current >> > > >>> exceeds a larger size (we can call it batch.max.size), say 256KB >> by >> > > >>> default. >> > > >>> 2. When we drain, consider batch 'ready' if it exceeds batch.size, >> > > which >> > > >>> is >> > > >>> 16KB by default. >> > > >>> >> > > >>> For memory conservation we may introduce batch.initial.size if we >> want >> > > to >> > > >>> have a flexibility to make it even smaller than batch.size, or we >> can >> > > >>> just >> > > >>> always use batch.size as the initial size (in which case we don't >> > > >>> need batch.initial.size config). >> > > >>> >> > > >>> -Artem >> > > >>> >> > > >>> On Fri, Oct 22, 2021 at 1:52 AM Luke Chen <show...@gmail.com> >> wrote: >> > > >>> >> > > >>> > Hi Kafka dev, >> > > >>> > I'd like to start a vote for the proposal: KIP-782: Expandable >> batch >> > > >>> size >> > > >>> > in producer. >> > > >>> > >> > > >>> > The main purpose for this KIP is to have better memory usage in >> > > >>> producer, >> > > >>> > and also save users from the dilemma while setting the batch >> size >> > > >>> > configuration. After this KIP, users can set a higher batch.size >> > > >>> without >> > > >>> > worries, and of course, with an appropriate >> "batch.initial.size". >> > > >>> > >> > > >>> > Derailed description can be found here: >> > > >>> > >> > > >>> > >> > > >>> >> > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-782%3A+Expandable+batch+size+in+producer >> > > >>> > >> > > >>> > Any comments and feedback are welcome. >> > > >>> > >> > > >>> > Thank you. >> > > >>> > Luke >> > > >>> > >> > > >>> >> > > >> >> > > >> >