Hi Michael, Yes, it's a good idea and I considered it, the main problem is the *FileRecords* class does not accept number of batches as a parameter, it accepts bytes instead, so if we want to do so either we redesign a core class or we create a new one. One of the pretty things (I consider) about this change which will bring a huge benefit is the change in the code is pretty small and it's on the final script, it does not require any deep change in a core library.
An alternative which requires a big change as well without touching the FileRecords class would be accept number of batches as parameter, then call the FileRecords.slice() with a small value (bytes) count the batches, see if we can satisfy the number of batches, if not then we call it again and again until we reach the amount of batches. This will require a lot of code as well So long story short, the proposal change is quite small, it uses the current classes and has a big benefit. Maybe in the future we could consider the FileRecords class to support getting the amount of batches as parameters and we encapsulate this logic in the proper class (FileRecords) What do you think? Thanks Sergio Thanks Sergio On Tue, 8 Mar 2022 at 18:32, Mickael Maison <mickael.mai...@gmail.com> wrote: > Hi Sergio, > > Thanks for the KIP. Instead of specifying the size in bytes, have you > considered specifying it in terms of number of batches? I think it's a > bit more user friendly than a size in raw bytes. > For example: --num-batches: The number of batches to read from the log > segment. > > Thanks, > Mickael > > > On Tue, Mar 8, 2022 at 5:27 AM Sergio Daniel Troiano > <sergio.troi...@adevinta.com.invalid> wrote: > > > > Hi Luke, > > > > Make sense, done! > > > > Thank you. > > Sergio Troiano > > > > On Tue, 8 Mar 2022 at 03:02, Luke Chen <show...@gmail.com> wrote: > > > > > Hi Sergio, > > > > > > > I don't want this to minimize the main feature I want to deploy as I > > > think the > > > message size limit is not as important as the limiting the amount of > > > batches. > > > > > > Agree! Let's focus on the feature of limiting the batch amounts. > > > > > > One more comment to the KIP: > > > 1. Could you put the new parameter description into the KIP proposed > change > > > section? That would make it much clear. > > > > > > > > > Thank you. > > > Luke > > > > > > On Mon, Mar 7, 2022 at 8:44 PM Sergio Daniel Troiano > > > <sergio.troi...@adevinta.com.invalid> wrote: > > > > > > > hey Luke, > > > > > > > > I am interested in expanding the KIP scope but I am a bit concerned > this > > > > could create a lot of noise and confusion as they look like very > similar > > > > parameters, I agree this is a small change, so I think if I do it > > > properly > > > > it should not be a problem at all, I just will need a couple more of > days > > > > as I want to create the proper tests as well. > > > > > > > > I have a doubt about editing the KIP, I mean should I add this as a > new > > > > feature as well?, should I describe this as a side effect finding? I > > > don't > > > > want this to minimize the main feature I want to deploy as I think > the > > > > message size limit is not as important as the limiting the amount of > > > > batches. > > > > > > > > It is up to you, if you guys consider we must add this in this KIP > then I > > > > will be happy to do it. 😀 > > > > > > > > Best regards. > > > > Sergio Troiano > > > > > > > > On Mon, 7 Mar 2022 at 02:01, Luke Chen <show...@gmail.com> wrote: > > > > > > > > > Hi Sergio, > > > > > > > > > > Thanks for your explanation. > > > > > Make sense to me. > > > > > > > > > > > Only interesting thing that I have just found is > *max-message-size > > > *is > > > > > not > > > > > used while dump logs are requested, instead it is used by dumpIndex > > > > > > > > > > Are you interested in expanding the scope of this KIP to include > the > > > > > *max-message-size* in dumping logs? > > > > > I think it's good because it will also be a small change, and no > need > > > to > > > > go > > > > > through another KIP discussing/voting process. But I'm fine if you > want > > > > to > > > > > keep this KIP as is, and create another JIRA ticket for future > work. > > > > > > > > > > Thank you. > > > > > Luke > > > > > > > > > > On Mon, Mar 7, 2022 at 6:02 AM Sergio Daniel Troiano > > > > > <sergio.troi...@adevinta.com.invalid> wrote: > > > > > > > > > > > hey Luke, > > > > > > > > > > > > Let me answer them: > > > > > > 1. If the *max-batches-size* is too small that results in no > records > > > > > > output, will we output any information to the user? > > > > > > > > > > > > If the *max-batches-size*is even smaller than the first batch > then > > > > there > > > > > > won't be any output, this is handled by FileRecords class, I > think > > > this > > > > > is > > > > > > correct as this is the expected behaviour. > > > > > > > > > > > > 2. After your explanation, I guess the use of *max-batches-size* > > > won't > > > > > > conflict with *max-message-size*, right? > > > > > > > > > > > > Only interesting thing that I have just found is > *max-message-size > > > *is > > > > > not > > > > > > used while dump logs are requested, instead it is used by > dumpIndex > > > > > > < > > > > > > > > > > > > > > > > > > > https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/tools/DumpLogSegments.scala#L150 > > > > > > > > > > > > > so, > > > > > > this feature is not working for dumping logs, even though I > checked > > > if > > > > > > there is a unit test for this and there is not any. Maybe we > could > > > > > create a > > > > > > ticket for this? > > > > > > > > > > > > Regards. > > > > > > > > > > > > > > > > > > On Sat, 5 Mar 2022 at 10:36, Luke Chen <show...@gmail.com> > wrote: > > > > > > > > > > > > > Hi Sergio, > > > > > > > > > > > > > > Thanks for the explanation! Very clear! > > > > > > > I think we should put this example and explanation into KIP. > > > > > > > > > > > > > > Other comments: > > > > > > > 1. If the *max-batches-size* is too small that results in no > > > records > > > > > > > output, will we output any information to the user? > > > > > > > 2. After your explanation, I guess the use of > *max-batches-size* > > > > won't > > > > > > > conflict with *max-message-size*, right? > > > > > > > That is, user can set the 2 arguments at the same time. Is that > > > > > correct? > > > > > > > > > > > > > > Thank you. > > > > > > > Luke > > > > > > > > > > > > > > Thank you. > > > > > > > Luke > > > > > > > > > > > > > > On Sat, Mar 5, 2022 at 4:47 PM Sergio Daniel Troiano > > > > > > > <sergio.troi...@adevinta.com.invalid> wrote: > > > > > > > > > > > > > > > hey Luke, > > > > > > > > > > > > > > > > thanks for the interest, it is a good question, please let me > > > > explain > > > > > > > you: > > > > > > > > > > > > > > > > *max-message-size *a filter for the size of each batch, so > for > > > > > example > > > > > > if > > > > > > > > Iset --max-message-size 1000 bytes and my segment log has 300 > > > > > batches, > > > > > > > 150 > > > > > > > > of them has a size of 500 bytes and the other 150 has a > size of > > > > 2000 > > > > > > > bytes > > > > > > > > then the script will skip the las 150 ones as each batch is > > > heavier > > > > > > than > > > > > > > > the limit. > > > > > > > > > > > > > > > > In the other hand following the same example above with > > > > > > *max-batches-size > > > > > > > > *set > > > > > > > > to 1000 bytes it will only print out the first 2 batches (500 > > > bytes > > > > > > each) > > > > > > > > and stop, This will avoid reading the whole file > > > > > > > > > > > > > > > > > > > > > > > > Also if all of them are smaller than 1000 bytes it will end > up > > > > > printing > > > > > > > out > > > > > > > > all the batches. > > > > > > > > The idea of my change is to limit the *amount* of batches no > > > matter > > > > > > their > > > > > > > > size. > > > > > > > > > > > > > > > > I hope this reply helps. > > > > > > > > Best regards. > > > > > > > > > > > > > > > > On Sat, 5 Mar 2022 at 08:00, Luke Chen <show...@gmail.com> > > > wrote: > > > > > > > > > > > > > > > > > Hi Sergio, > > > > > > > > > > > > > > > > > > Thanks for the KIP! > > > > > > > > > > > > > > > > > > One question: > > > > > > > > > I saw there's a `max-message-size` argument that seems to > do > > > the > > > > > same > > > > > > > > thing > > > > > > > > > as you want. > > > > > > > > > Could you help explain what's the difference between > > > > > > `max-message-size` > > > > > > > > and > > > > > > > > > `max-batches-size`? > > > > > > > > > > > > > > > > > > Thank you. > > > > > > > > > Luke > > > > > > > > > > > > > > > > > > On Sat, Mar 5, 2022 at 3:21 AM Kirk True < > > > k...@mustardgrain.com> > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Hi Sergio, > > > > > > > > > > > > > > > > > > > > Thanks for the KIP. I don't know anything about the log > > > segment > > > > > > > > > internals, > > > > > > > > > > but the logic and implementation seem sound. > > > > > > > > > > > > > > > > > > > > Three questions: > > > > > > > > > > 1. Since the --max-batches-size unit is bytes, does it > > > matter > > > > if > > > > > > > that > > > > > > > > > > size doesn't align to a record boundary? > > > > > > > > > > 2. Can you add a check to make sure that > --max-batches-size > > > > > > doesn't > > > > > > > > > allow > > > > > > > > > > the user to pass in a negative number? > > > > > > > > > > 3. Can you add/update any unit tests related to the > > > > > > DumpLogSegments > > > > > > > > > > arguments? > > > > > > > > > > Thanks, > > > > > > > > > > Kirk > > > > > > > > > > > > > > > > > > > > On Thu, Mar 3, 2022, at 1:32 PM, Sergio Daniel Troiano > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-824%3A+Allowing+dumping+segmentlogs+limiting+the+batches+in+the+output > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >