Re: [VOTE] KIP-1001; CurrentControllerId Metric

2023-11-21 Thread David Jacot
+1 from me.

Thanks,
David

On Mon, Nov 20, 2023 at 10:48 PM Jason Gustafson 
wrote:

> The KIP makes sense. +1
>
> On Mon, Nov 20, 2023 at 12:37 PM David Arthur
>  wrote:
>
> > Thanks Colin,
> >
> > +1 from me
> >
> > -David
> >
> > On Tue, Nov 14, 2023 at 3:53 PM Colin McCabe  wrote:
> >
> > > Hi all,
> > >
> > > I'd like to call a vote for KIP-1001: Add CurrentControllerId metric.
> > >
> > > Take a look here:
> > > https://cwiki.apache.org/confluence/x/egyZE
> > >
> > > best,
> > > Colin
> > >
> >
> >
> > --
> > -David
> >
>


[jira] [Resolved] (KAFKA-15836) KafkaConsumer subscribes to multiple topics does not respect max.poll.records

2023-11-21 Thread Philip Nee (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philip Nee resolved KAFKA-15836.

Resolution: Fixed

PR merged.

> KafkaConsumer subscribes to multiple topics does not respect max.poll.records
> -
>
> Key: KAFKA-15836
> URL: https://issues.apache.org/jira/browse/KAFKA-15836
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Philip Nee
>Assignee: Andrew Schofield
>Priority: Blocker
>  Labels: consumer
> Fix For: 3.7.0
>
>
> We discovered that when KafkaConsumer subscribes to multiple topics with 
> max.poll.record configured.  The max.poll.record is not properly respected 
> for all poll() invocation.
>  
> I was able to reproduce it with the AK example, here is how I ran my tests:
> [https://github.com/apache/kafka/pull/14772]
>  
> 1. start zookeeper and kafka server (or kraft mode should be fine too)
> 2. Run: examples/bin/java-producer-consumer-demo.sh 1000
> 3. Polled records > 400 will be printed to stdout
>  
> Here is what the program does:
> The produce produces a large number of records to multiple topics.  We 
> configure the consumer using a max.poll.record = 400, and subscribed to 
> multiple topics.  The consumer poll, and the returned records can sometimes 
> be larger than 400.
>  
> This is an issue in AK 3.6 but 3.5 was fine.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] KIP-997: Partition-Level Throughput Metrics

2023-11-21 Thread Qichao Chu
Hi All,

It would be nice if we could have more people to review and vote for this
KIP.
Many thanks!

Qichao


On Mon, Nov 20, 2023 at 2:43 PM Qichao Chu  wrote:

> @Matthias: yeah it should be 977, sorry for the confusion.
> Btw, do you want to cast another binding vote for it?
>
> Best,
> Qichao Chu
>
>
> On Fri, Nov 17, 2023 at 12:45 AM Matthias J. Sax  wrote:
>
>> This is KIP-977, right? Not as the subject says.
>>
>> Guess we won't be able to fix this now. Hope it does not cause confusion
>> down the line...
>>
>>
>> -Matthias
>>
>> On 11/16/23 4:43 AM, Kamal Chandraprakash wrote:
>> > +1 (non-binding). Thanks for the KIP!
>> >
>> > On Thu, Nov 16, 2023 at 9:00 AM Satish Duggana <
>> satish.dugg...@gmail.com>
>> > wrote:
>> >
>> >> Thanks Qichao for the KIP.
>> >>
>> >> +1 (binding)
>> >>
>> >> ~Satish.
>> >>
>> >> On Thu, 16 Nov 2023 at 02:20, Jorge Esteban Quilcate Otoya
>> >>  wrote:
>> >>>
>> >>> Qichao, thanks again for leading this proposal!
>> >>>
>> >>> +1 (non-binding)
>> >>>
>> >>> Cheers,
>> >>> Jorge.
>> >>>
>> >>> On Wed, 15 Nov 2023 at 19:17, Divij Vaidya 
>> >> wrote:
>> >>>
>>  +1 (binding)
>> 
>>  I was involved in the discussion thread for this KIP and support it
>> in
>> >> its
>>  current form.
>> 
>>  --
>>  Divij Vaidya
>> 
>> 
>> 
>>  On Wed, Nov 15, 2023 at 10:55 AM Qichao Chu > >
>>  wrote:
>> 
>> > Hi all,
>> >
>> > I'd like to call a vote for KIP-977: Partition-Level Throughput
>> >> Metrics.
>> >
>> > Please take a look here:
>> >
>> >
>> 
>> >>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-977%3A+Partition-Level+Throughput+Metrics
>> >
>> > Best,
>> > Qichao Chu
>> >
>> 
>> >>
>> >
>>
>


Jenkins build is unstable: Kafka » Kafka Branch Builder » trunk #2402

2023-11-21 Thread Apache Jenkins Server
See 




[VOTE] 3.5.2 RC1

2023-11-21 Thread Luke Chen
Hello Kafka users, developers and client-developers,

This is the first candidate for release of Apache Kafka 3.5.2.

This is a bugfix release with several fixes since the release of 3.5.1,
including dependency version bumps for CVEs.

Release notes for the 3.5.2 release:
https://home.apache.org/~showuon/kafka-3.5.2-rc1/RELEASE_NOTES.html

*** Please download, test and vote by Nov. 28.

Kafka's KEYS file containing PGP keys we use to sign the release:
https://kafka.apache.org/KEYS

* Release artifacts to be voted upon (source and binary):
https://home.apache.org/~showuon/kafka-3.5.2-rc1/

* Maven artifacts to be voted upon:
https://repository.apache.org/content/groups/staging/org/apache/kafka/

* Javadoc:
https://home.apache.org/~showuon/kafka-3.5.2-rc1/javadoc/

* Tag to be voted upon (off 3.5 branch) is the 3.5.2 tag:
https://github.com/apache/kafka/releases/tag/3.5.2-rc1

* Documentation:
https://kafka.apache.org/35/documentation.html

* Protocol:
https://kafka.apache.org/35/protocol.html

* Successful Jenkins builds for the 3.5 branch:
Unit/integration tests:
https://ci-builds.apache.org/job/Kafka/job/kafka/job/3.5/98/
There are some falky tests, including the testSingleIP test failure. It
failed because of some infra change and we fixed it
 recently.

System tests: running, will update the results later.



Thank you.
Luke


Re: [VOTE] KIP-968: Support single-key_multi-timestamp interactive queries (IQv2) for versioned state stores

2023-11-21 Thread Alieh Saeedi
Thanks, Matthias; I changed it to `ANY` which is the shortest and not
misleading.

Cheers,
Alieh

On Mon, Nov 20, 2023 at 7:42 PM Matthias J. Sax  wrote:

> Adding an enum is a good idea!
>
> Wondering if `UNORDERED` is the best name? Want to avoid bike shedding,
> just asking.
>
> We could also use `UNDEFINED` / `UNSPECIFIED` / `NONE` / `ANY` ?
>
> In the end, the result _might_ be ordered, we just don't guarantee any
> order.
>
>
> -Matthias
>
> On 11/20/23 9:17 AM, Alieh Saeedi wrote:
> > Hi all,
> > I added the public enum `ResultOrder` to the KIP which helps with keeping
> > three values (unordered, ascending, and descending) for the query
> results.
> > Therefore the method `isAscending()` is changed to `resultOrder()` which
> > returns either the user specified result order or `unorderd`.
> > Cheers,
> > Alieh
> >
> > On Mon, Nov 20, 2023 at 1:40 PM Alieh Saeedi 
> wrote:
> >
> >> Thank you, Guozhag and Bruno, for reviewing the KIP and reading the
> whole
> >> discussion thread. I appreciate your help:)
> >> The KIP is now corrected and updated.
> >>
> >> Cheers,
> >> Alieh
> >>
> >> On Mon, Nov 20, 2023 at 10:43 AM Bruno Cadonna 
> wrote:
> >>
> >>> Thanks Alieh,
> >>>
> >>> I am +1 (binding).
> >>>
> >>> However, although we agreed on not specifying an order of the results
> by
> >>> default, there is still the following  sentence in the KIP:
> >>>
> >>> "The order of the returned records is by default ascending by
> timestamp.
> >>> The method withDescendingTimestamps() can reverse the order. Btw,
> >>> withAscendingTimestamps() method can be used for code readability
> >>> purpose. "
> >>>
> >>> Could you please change it and also fix what Guozhang commented?
> >>>
> >>> Best,
> >>> Bruno
> >>>
> >>> On 11/19/23 2:12 AM, Guozhang Wang wrote:
>  Thanks Alieh,
> 
>  I read through the wiki page and the DISCUSS thread, all LGTM except a
>  minor thing in javadoc:
> 
>  "The query returns the records with a global ascending order of keys.
>  The records with the same key are ordered based on their insertion
>  timestamp in ascending order. Both the global and partial ordering are
>  modifiable with the corresponding methods defined for the class."
> 
>  Since this KIP is only for a single key, there's no key ordering but
>  only timestamp ordering right? Maybe the javadoc can be updated
>  accordingly.
> 
>  Otherwise, LGTM.
> 
>  On Fri, Nov 17, 2023 at 2:36 AM Alieh Saeedi
>   wrote:
> >
> > Hi all,
> > Following my recent message in the discussion thread, I am opening
> the
> > voting for KIP-968. Thanks for your votes in advance.
> >
> > Cheers,
> > Alieh
> >>>
> >>
> >
>


Re: [DISCUSS] KIP-968: Support single-key_multi-timestamp interactive queries (IQv2) for versioned state stores

2023-11-21 Thread Alieh Saeedi
Yes Matthias,
Based on the discussion we had, it has now been changed to Optional and the
default is empty (for the latest). Also, the `validTo()` method returns an
Optional.

Cheers,
Alieh

On Mon, Nov 20, 2023 at 7:38 PM Matthias J. Sax  wrote:

> I think we should also discuss a little more about `validTo()` method?
>
> Given that "latest" version does not have a valid-to TS, should we
> change the return type to `Optional` and return `empty()` for "latest"?
>
> ATM the KIP uses `MAX_VALUE` for "latest" what seems to be less clean?
> We could also use `-1` (unknown), but both might be less expressive than
> `Optional`?
>
>
> -Matthias
>
> On 11/20/23 1:59 AM, Bruno Cadonna wrote:
> > Hi Alieh,
> >
> > Although, I've already voted, I found a minor miss. You should also add
> > a method isDescending() since the results could also be unordered now
> > that we agreed that the results are unordered by default. If both --
> > isDescending() and isAscending -- are false neither
> > withDescendingTimestamps() nor withAscendingTimestamps() was called.
> >
> > Best,
> > Bruno
> >
> > On 11/17/23 11:25 AM, Alieh Saeedi wrote:
> >> Hi all,
> >> Thank you for the feedback.
> >>
> >> So we agreed on no default ordering for keys and TSs. So I must provide
> >> both withAscendingXx() and with DescendingXx() for the class.
> >> Apart from that, I think we can either remove the existing constructor
> >> for
> >> the `VersionedRecord` class or follow the `Optional` thing.
> >>
> >> Since many hidden aspects of the KIP are quite clear now and we have
> come
> >> to a consensus about them, I think it 's time to vote ;-)
> >> I look forward to your votes. Thanks a lot.
> >>
> >> Cheers,
> >> Alieh
> >>
> >> On Fri, Nov 17, 2023 at 2:27 AM Matthias J. Sax 
> wrote:
> >>
> >>> Thanks, Alieh.
> >>>
> >>> Overall SGTM. About `validTo` -- wondering if we should make it an
> >>> `Optional` and set to `empty()` by default?
> >>>
> >>> I am totally ok with going with the 3-way option about ordering using
> >>> default "undefined". For this KIP (as it's all net new) nothing really
> >>> changes. -- However, we should amend `RangeQuery`/KIP-985 to align it.
> >>>
> >>> Btw: so far we focused on key-ordering, but I believe the same
> "ordering
> >>> undefined by default" would apply to time-ordering, too? This might
> >>> affect KIP-997, too.
> >>>
> >>>
> >>> -Matthias
> >>>
> >>> On 11/16/23 12:51 AM, Bruno Cadonna wrote:
>  Hi,
> 
>  80)
>  We do not keep backwards compatibility with IQv1, right? I would even
>  say that currently we do not need to keep backwards compatibility
> among
>  IQv2 versions since we marked the API "Evolving" (do we only mean code
>  compatibility here or also behavioral compatibility?). I propose to
> try
>  to not limit ourselves for backwards compatibility that we explicitly
>  marked as evolving.
>  I re-read the discussion on KIP-985. In that discussion, we were quite
>  focused on what the state store provides. I see that for range
> queries,
>  we have methods on the state store interface that specify the order,
>  but
>  that should be kind of orthogonal to the IQv2 query type. Let's assume
>  somebody in the future adds a state store implementation that is not
>  order based. To account for use cases where the order does not matter,
>  this person might also add a method to the state store interface that
>  does not guarantee any order. However, our range query type is
>  specified
>  to guarantee order by default. So we need to add something like
>  withNoOrder() to the query type to allow the use cases that does not
>  need order and has the better performance in IQ. That does not look
>  very
>  nice to me. Having the no-order-guaranteed option does not cost us
>  anything and it keeps the IQv2 interface flexible. I assume we want to
>  drop the Evolving annotation at some point.
>  Sorry for not having brought this up in the discussion about KIP-985.
> 
>  Best,
>  Bruno
> 
> 
> 
> 
> 
>  On 11/15/23 6:56 AM, Matthias J. Sax wrote:
> > Just catching up on this one.
> >
> >
> > 50) I am also in favor of setting `validTo` in VersionedRecord for
> > single-key single-ts lookup; it seems better to return the proper
> > timestamp. The timestamp is already in the store and it's cheap to
> > extract it and add to the result, and it might be valuable
> information
> > for the user. Not sure though if we should deprecate the existing
> > constructor though, because for "latest" it's convenient to have?
> >
> >
> > 60) Yes, I meant `VersionedRecord`. Sorry for the mixup.
> >
> >
> > 80) We did discuss this question on KIP-985 (maybe you missed it
> > Bruno). It's kinda tricky.
> >
> > Historically, it seems that IQv1, ie, the `ReadOnlyXxx` interfaces
> > provide a clear contract that `ra

[jira] [Created] (KAFKA-15868) KIP-951 - Leader discovery optimisations for the client

2023-11-21 Thread Mayank Shekhar Narula (Jira)
Mayank Shekhar Narula created KAFKA-15868:
-

 Summary: KIP-951 - Leader discovery optimisations for the client
 Key: KAFKA-15868
 URL: https://issues.apache.org/jira/browse/KAFKA-15868
 Project: Kafka
  Issue Type: Improvement
  Components: clients
Affects Versions: 3.7.0
Reporter: Mayank Shekhar Narula
Assignee: Mayank Shekhar Narula
 Fix For: 3.7.0


https://cwiki.apache.org/confluence/display/KAFKA/KIP-951%3A+Leader+discovery+optimisations+for+the+client



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15868) KIP-951 - Leader discovery optimisations for the client

2023-11-21 Thread Mayank Shekhar Narula (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Shekhar Narula resolved KAFKA-15868.
---
Resolution: Fixed

> KIP-951 - Leader discovery optimisations for the client
> ---
>
> Key: KAFKA-15868
> URL: https://issues.apache.org/jira/browse/KAFKA-15868
> Project: Kafka
>  Issue Type: Improvement
>  Components: clients
>Affects Versions: 3.7.0
>Reporter: Mayank Shekhar Narula
>Assignee: Mayank Shekhar Narula
>Priority: Major
> Fix For: 3.7.0
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-951%3A+Leader+discovery+optimisations+for+the+client



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Apache Kafka 3.7.0 Release

2023-11-21 Thread Mayank Shekhar Narula
Hi Stan

Can you include KIP-951 to the 3.7 release plan? All PRs are merged in the
trunk.

On Wed, Nov 15, 2023 at 4:05 PM Stanislav Kozlovski
 wrote:

> Friendly reminder to everybody that the KIP Freeze is *exactly 7 days away*
> - November 22.
>
> A KIP must be accepted by this date in order to be considered for this
> release. Note, any KIP that may not be implemented in time, or otherwise
> risks heavily destabilizing the release, should be deferred.
>
> Best,
> Stan
>
> On Fri, Nov 3, 2023 at 6:03 AM Sophie Blee-Goldman 
> wrote:
>
> > Looks great, thank you! +1
> >
> > On Thu, Nov 2, 2023 at 10:21 AM David Jacot  >
> > wrote:
> >
> > > +1 from me as well. Thanks, Stan!
> > >
> > > David
> > >
> > > On Thu, Nov 2, 2023 at 6:04 PM Ismael Juma  wrote:
> > >
> > > > Thanks Stanislav, +1
> > > >
> > > > Ismael
> > > >
> > > > On Thu, Nov 2, 2023 at 7:01 AM Stanislav Kozlovski
> > > >  wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > Given the discussion here and the lack of any pushback, I have
> > changed
> > > > the
> > > > > dates of the release:
> > > > > - KIP Freeze - *November 22 *(moved 4 days later)
> > > > > - Feature Freeze - *December 6 *(moved 2 days earlier)
> > > > > - Code Freeze - *December 20*
> > > > >
> > > > > If anyone has any thoughts against this proposal - please let me
> > know!
> > > It
> > > > > would be good to settle on this early. These will be the dates
> we're
> > > > going
> > > > > with
> > > > >
> > > > > Best,
> > > > > Stanislav
> > > > >
> > > > > On Thu, Oct 26, 2023 at 12:15 AM Sophie Blee-Goldman <
> > > > > sop...@responsive.dev>
> > > > > wrote:
> > > > >
> > > > > > Thanks for the response and explanations -- I think the main
> > question
> > > > for
> > > > > > me
> > > > > > was whether we intended to permanently increase the KF -- FF gap
> > from
> > > > the
> > > > > > historical 1 week to 3 weeks? Maybe this was a conscious decision
> > > and I
> > > > > > just
> > > > > >  missed the memo, hopefully someone else can chime in here. I'm
> all
> > > for
> > > > > > additional though. And looking around at some of the recent
> > releases,
> > > > it
> > > > > > seems like we haven't been consistently following the "usual"
> > > schedule
> > > > > > since
> > > > > > the 2.x releases.
> > > > > >
> > > > > > Anyways, my main concern was making sure to leave a full 2 weeks
> > > > between
> > > > > > feature freeze and code freeze, so I'm generally happy with the
> new
> > > > > > proposal.
> > > > > > Although I would still prefer to have the KIP freeze fall on a
> > > > Wednesday
> > > > > --
> > > > > > Ismael actually brought up the same thing during the 3.5.0
> release
> > > > > > planning,
> > > > > > so I'll just refer to his explanation for this:
> > > > > >
> > > > > > We typically choose a Wednesday for the various freeze dates -
> > there
> > > > are
> > > > > > > often 1-2 day slips and it's better if that doesn't require
> > people
> > > > > > > working through the weekend.
> > > > > > >
> > > > > >
> > > > > > (From this mailing list thread
> > > > > > <
> https://lists.apache.org/thread/dv1rym2jkf0141sfsbkws8ckkzw7st5h
> > >)
> > > > > >
> > > > > > Thanks for driving the release!
> > > > > > Sophie
> > > > > >
> > > > > > On Wed, Oct 25, 2023 at 8:13 AM Stanislav Kozlovski
> > > > > >  wrote:
> > > > > >
> > > > > > > Thanks for the thorough response, Sophie.
> > > > > > >
> > > > > > > - Added to the "Future Release Plan"
> > > > > > >
> > > > > > > > 1. Why is the KIP freeze deadline on a Saturday?
> > > > > > >
> > > > > > > It was simply added as a starting point - around 30 days from
> the
> > > > > > > announcement. We can move it earlier to the 15th of November,
> but
> > > my
> > > > > > > thinking is later is better with these things - it's already
> > > > aggressive
> > > > > > > enough. e.g given the choice of Nov 15 vs Nov 18, I don't
> > > necessarily
> > > > > > see a
> > > > > > > strong reason to choose 15.
> > > > > > >
> > > > > > > If people feel strongly about this, to make up for this, we can
> > eat
> > > > > into
> > > > > > > the KF-FF time as I'll touch upon later, and move FF a few days
> > > > earlier
> > > > > > to
> > > > > > > land on a Wednesday.
> > > > > > >
> > > > > > > This reduces the time one has to get their feature complete
> after
> > > KF,
> > > > > but
> > > > > > > allows for longer time to a KIP accepted, so the KF-FF gap can
> be
> > > > made
> > > > > up
> > > > > > > when developing the feature in parallel.
> > > > > > >
> > > > > > > > , this makes it easy for everyone to remember when the next
> > > > deadline
> > > > > is
> > > > > > > so they can make sure to get everything in on time. I worry
> that
> > > > > varying
> > > > > > > this will catch people off guard.
> > > > > > >
> > > > > > > I don't see much value in optimizing the dates for ease of
> > memory -
> > > > > > besides
> > > > > > > the KIP Freeze (which is the base date), there are only two
> more
> > > > dates
> > > > > to

[jira] [Created] (KAFKA-15869) Document semantics of nullable nested API entities

2023-11-21 Thread Anton Agestam (Jira)
Anton Agestam created KAFKA-15869:
-

 Summary: Document semantics of nullable nested API entities
 Key: KAFKA-15869
 URL: https://issues.apache.org/jira/browse/KAFKA-15869
 Project: Kafka
  Issue Type: Wish
Reporter: Anton Agestam


The initial version of ConsumerGroupHeartbeatResponse [introduced the first 
field across the protocol that is a nullable nested 
entity|https://github.com/dajac/kafka/blob/3acd87a3e82e1d2fd4c07218d362e7665b99c547/clients/src/main/resources/common/message/ConsumerGroupHeartbeatResponse.json#L48].
 As the implementor of a third-party schema parser it is not clear how to 
handle this field, where such fields are allowed, and how null is represented 
for such fields.

As far as I can tell, the [protocol 
guide|https://kafka.apache.org/protocol.html#The_Messages_ConsumerGroupHeartbeat]
 does not mention the nullability at all.

The reason I ask where such fields are allowed is because if the answer to how 
null is represented here is just omitting writing any bytes, then I suspect the 
only unambiguous place for such field to appear would be as the last field of a 
top-level entity. Even then, how is it discriminated from tagged fields?

Is it possible this field was made nullable by mistake?

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Road to Kafka 4.0

2023-11-21 Thread Josep Prat
Hi Colin,

I think it's great that Confluent runs KRaft clusters in production, and it 
means that it is production ready for Confluent and it's users. But luckily for 
Kafka, the community is bigger than this (self managed in the cloud or in-prem, 
or customers of other SaaS companies).
We've heard at least from 1 SaaS company, Aiven (disclaimer, it is my employer) 
where the current feature set makes it not trivial to migrate. This same issue 
might happen not only at Aiven but with any user of Kafka who uses immutable 
infrastructure. Another case is for users that have hundreds (or more) of 
clusters and more than 100k nodes experience node failures multiple times 
during a single day. In this situation, not having KIP 853 makes these power 
users unable to join the game as  introducing a new error-prone manual (or 
needed to automate) operation is usually a huge no-go.

But I hear the concerns of delaying 4.0 for another 3 to 4 months. Would it 
help if we would aim at shortening the timeline for 3.8.0 and start with the 
4.0.0 a bit earlier help?
Maybe we could work on 3.8.0 almost in parallel with 4.0.0:
- Start with 3.8.0 release process
- After a small time (let's say a week) create the release branch
- Start with 4.0.0 release process as usual
- Cherry pick KRaft related issues to 3.8.0
- Release 3.8.0
I suspect 4.0.0 will need a bit more time than usual to ensure the code is 
cleaned up of deprecated classes and methods on top of the usual work we have. 
For this reason I think there would be enough time between releasing 3.8.0 and 
4.0.0.

What do you all think?

Best,
Josep Prat

On 2023/11/20 20:03:18 Colin McCabe wrote:
> Hi Josep,
> 
> I think there is some confusion here. Quorum reconfiguration is not needed 
> for KRaft to become production ready. Confluent runs thousands of KRaft 
> clusters without quorum reconfiguration, and has for years. While dynamic 
> quorum reconfiguration is a nice feature, it doesn't block anything: not 
> migration, not deployment. As best as I understand it, the use-case Aiven has 
> isn't even reconfiguration per se, just wiping a disk. There are ways to 
> handle this -- I discussed some earlier in the thread. I think it would be 
> productive to continue that discussion -- especially the part around 
> documentation and testing of these cases.
> 
> A lot of people have done a lot of work to get Kafka 4.0 ready. I would not 
> want to delay that because we want an additional feature. And we will always 
> want additional features. So I am concerned we will end up in an infinite 
> loop of people asking for "just one more feature" before they migrate.
> 
> best,
> Colin
> 
> 
> On Mon, Nov 20, 2023, at 04:15, Josep Prat wrote:
> > Hi all,
> >
> > I wanted to share my opinion regarding this topic. I know some 
> > discussions happened some time ago (over a year) but I believe it's 
> > wise to reflect and re-evaluate if those decisions are still valid.
> > KRaft, as of Kafka 3.6.x and 3.7.x, has not yet feature parity with 
> > Zookeeper. By dropping Zookeeper altogether before achieving such 
> > parity, we are opening the door to leaving a chunk of Apache Kafka 
> > users without an easy way to upgrade to 4.0.
> > In pro of making upgrades as smooth as possible, I propose to have a 
> > Kafka version where KIP-853 is merged and Zookeeper still is supported. 
> > This will enable community members who can't migrate yet to KRaft to do 
> > so in a safe way (rolling back is something goes wrong). Additionally, 
> > this will give us more confidence on having KRaft replacing 
> > successfully Zookeeper without any big problems by discovering and 
> > fixing bugs or by confirming that KRaft works as expected.
> > For this I strongly believe we should have a 3.8.x version before 4.0.x.
> >
> > What do other think in this regard?
> >
> > Best,
> >
> > On 2023/11/14 20:47:10 Colin McCabe wrote:
> >> On Tue, Nov 14, 2023, at 04:37, Anton Agestam wrote:
> >> > Hi Colin,
> >> >
> >> > Thank you for your thoughtful and comprehensive response.
> >> >
> >> >> KIP-853 is not a blocker for either 3.7 or 4.0. We discussed this in
> >> >> several KIPs that happened this year and last year. The most notable was
> >> >> probably KIP-866, which was approved in May 2022.
> >> >
> >> > I understand this is the case, I'm raising my concern because I was
> >> > foreseeing some major pain points as a consequence of this decision. Just
> >> > to make it clear though: I am not asking for anyone to do work for me, 
> >> > and
> >> > I understand the limitations of resources available to implement 
> >> > features.
> >> > What I was asking is rather to consider the implications of _removing_
> >> > features before there exists a replacement for them.
> >> >
> >> > I understand that the timeframe for 3.7 isn't feasible, and because of 
> >> > that
> >> > I think what I was asking is rather: can we make sure that there are more
> >> > 3.x releases until controller quorum online resizing i

Re: How Kafka handle partition leader change?

2023-11-21 Thread Andrew Grant
Hey De Gao,

Message loss or duplication can actually happen even without a leadership 
change for a partition. For example if there are network issues and the 
producer never gets the ack from the server, it’ll retry and cause duplicates. 
Message loss can usually occur when you use acks=1 config - mostly you’d lose 
after a leadership change but in theory if the leader was restarted, the page 
cache was lost and it stayed leader again we could lose the message if it 
wasn’t replicated soon enough. 

You might be right it’s more likely to occur during leadership change though - 
not 100% sure myself on that. 

Point being, the idempotent producer really is the way to write once and only 
once as far as I’m aware. 

If you have any suggestions for improvements I’m sure the community would love 
to hear them! It’s possible there are ways to make leadership changes more 
seamless and at least reduce the probability of duplicates or loss. Not sure 
myself. I’ve wondered before if the older leader could reroute messages for a 
small period of time until the client knew the new leader for example. 

Andrew 

Sent from my iPhone

> On Nov 21, 2023, at 1:42 AM, De Gao  wrote:
> 
> I am asking this because I want to propose a change to Kafka. But looks like 
> in certain scenario it is very hard to not loss or duplication messages. 
> Wonder in what scenario we can accept that and where to draw the line?
> 
> 
> From: De Gao 
> Sent: 21 November 2023 6:25
> To: dev@kafka.apache.org 
> Subject: Re: How Kafka handle partition leader change?
> 
> Thanks Andrew.  Sounds like the leadership change from Kafka side is a 'best 
> effort' to avoid message duplicate or loss. Can we say that message lost is 
> very likely during leadership change unless producer uses idempotency? Is 
> this a generic situation that no intent to provide data integration guarantee 
> upon metadata change?
> 
> From: Andrew Grant 
> Sent: 20 November 2023 12:26
> To: dev@kafka.apache.org 
> Subject: Re: How Kafka handle partition leader change?
> 
> Hey De Gao,
> 
> The controller is the one that always elects a new leader. When that happens 
> that metadata is changed on the controller and once committed it’s broadcast 
> to all brokers in the cluster. In KRaft this would be via a PartitonChange 
> record that each broker will fetch from the controller. In ZK it’d be via an 
> RPC from the controller to the broker.
> 
> In either case each broker might get the notification at a different time. No 
> ordering guarantee among the brokers. But eventually they’ll all know the new 
> leader which means eventually the Produce will fail with NotLeader and the 
> client will refresh its metadata and find out the new one.
> 
> In between all that leadership movement, there are various ways messages can 
> get duplicated or lost. However if you use the idempotent producer I believe 
> you actually won’t see dupes or missing messages so if that’s an important 
> requirement you could look into that. The producer is designed to retry in 
> general and when you use the idempotent producer some extra metadata is sent 
> around to dedupe any messages server-side that were sent multiple times by 
> the client.
> 
> If you’re interested in learning more Kafka internals I highly recommend this 
> blog series 
> https://www.google.com/url?q=https://www.confluent.io/blog/apache-kafka-architecture-and-internals-by-jun-rao/&source=gmail-imap&ust=170115373700&usg=AOvVaw1Bnr9YgbxvIt1NJmgdFzn5
> 
> Hope that helped a bit.
> 
> Andy
> 
> Sent from my iPhone
> 
>> On Nov 20, 2023, at 2:07 AM, De Gao  wrote:
>> 
>> Hi all I have a interesting question here.
>> 
>> Let's say we have 2 broker B1 B2, controller C and producer P1, P2...Pn. 
>> Currently B1 holds the partition leader and Px is constantly producing 
>> messages to B1. We want to move the partition leadership to B2. How does the 
>> leadership change synced between B1, B2, C, and Px that it is guaranteed 
>> that all the parties acknowledged the leadership change in the right order? 
>> Was there a break of produce flow in between? Any chance of  message lost?
>> 
>> Thanks
>> 
>> De Gao


Re: Requesting permissions to contribute to Apache Kafka

2023-11-21 Thread Josep Prat
Hi Ria,

You are now set. Thanks for your interest in Apache Kafka!

Best,

On Mon, Nov 20, 2023 at 5:48 PM Ria Pradeep (BLOOMBERG/ 919 3RD A) <
rprade...@bloomberg.net> wrote:

> I would like to request permission to contribute to Apache Kafka.
>
> wiki ID: rpradeep
> JIRA ID: rpradeep
>
> Thanks,
> Ria



-- 
[image: Aiven] 

*Josep Prat*
Open Source Engineering Director, *Aiven*
josep.p...@aiven.io   |   +491715557497
aiven.io    |   
     
*Aiven Deutschland GmbH*
Alexanderufer 3-7, 10117 Berlin
Geschäftsführer: Oskari Saarenmaa & Hannu Valtonen
Amtsgericht Charlottenburg, HRB 209739 B


Re: [VOTE] KIP-968: Support single-key_multi-timestamp interactive queries (IQv2) for versioned state stores

2023-11-21 Thread Lucas Brutschy
Hi Alieh,

thanks for the KIP!

+1 (binding)

Lucas

On Tue, Nov 21, 2023 at 11:26 AM Alieh Saeedi
 wrote:
>
> Thanks, Matthias; I changed it to `ANY` which is the shortest and not
> misleading.
>
> Cheers,
> Alieh
>
> On Mon, Nov 20, 2023 at 7:42 PM Matthias J. Sax  wrote:
>
> > Adding an enum is a good idea!
> >
> > Wondering if `UNORDERED` is the best name? Want to avoid bike shedding,
> > just asking.
> >
> > We could also use `UNDEFINED` / `UNSPECIFIED` / `NONE` / `ANY` ?
> >
> > In the end, the result _might_ be ordered, we just don't guarantee any
> > order.
> >
> >
> > -Matthias
> >
> > On 11/20/23 9:17 AM, Alieh Saeedi wrote:
> > > Hi all,
> > > I added the public enum `ResultOrder` to the KIP which helps with keeping
> > > three values (unordered, ascending, and descending) for the query
> > results.
> > > Therefore the method `isAscending()` is changed to `resultOrder()` which
> > > returns either the user specified result order or `unorderd`.
> > > Cheers,
> > > Alieh
> > >
> > > On Mon, Nov 20, 2023 at 1:40 PM Alieh Saeedi 
> > wrote:
> > >
> > >> Thank you, Guozhag and Bruno, for reviewing the KIP and reading the
> > whole
> > >> discussion thread. I appreciate your help:)
> > >> The KIP is now corrected and updated.
> > >>
> > >> Cheers,
> > >> Alieh
> > >>
> > >> On Mon, Nov 20, 2023 at 10:43 AM Bruno Cadonna 
> > wrote:
> > >>
> > >>> Thanks Alieh,
> > >>>
> > >>> I am +1 (binding).
> > >>>
> > >>> However, although we agreed on not specifying an order of the results
> > by
> > >>> default, there is still the following  sentence in the KIP:
> > >>>
> > >>> "The order of the returned records is by default ascending by
> > timestamp.
> > >>> The method withDescendingTimestamps() can reverse the order. Btw,
> > >>> withAscendingTimestamps() method can be used for code readability
> > >>> purpose. "
> > >>>
> > >>> Could you please change it and also fix what Guozhang commented?
> > >>>
> > >>> Best,
> > >>> Bruno
> > >>>
> > >>> On 11/19/23 2:12 AM, Guozhang Wang wrote:
> >  Thanks Alieh,
> > 
> >  I read through the wiki page and the DISCUSS thread, all LGTM except a
> >  minor thing in javadoc:
> > 
> >  "The query returns the records with a global ascending order of keys.
> >  The records with the same key are ordered based on their insertion
> >  timestamp in ascending order. Both the global and partial ordering are
> >  modifiable with the corresponding methods defined for the class."
> > 
> >  Since this KIP is only for a single key, there's no key ordering but
> >  only timestamp ordering right? Maybe the javadoc can be updated
> >  accordingly.
> > 
> >  Otherwise, LGTM.
> > 
> >  On Fri, Nov 17, 2023 at 2:36 AM Alieh Saeedi
> >   wrote:
> > >
> > > Hi all,
> > > Following my recent message in the discussion thread, I am opening
> > the
> > > voting for KIP-968. Thanks for your votes in advance.
> > >
> > > Cheers,
> > > Alieh
> > >>>
> > >>
> > >
> >


Re: [DISCUSS] KIP-1005: Add EarliestLocalOffset to GetOffsetShell

2023-11-21 Thread Christo Lolov
Heya!

Thanks a lot for this. I have updated the KIP to include exposing the
tiered-offset as well. Let me know whether the Public Interfaces section
needs more explanations regarding the changes needed to the OffsetSpec or
others.

Best,
Christo

On Tue, 21 Nov 2023 at 04:20, Satish Duggana 
wrote:

> Thanks Christo for starting the discussion on the KIP.
>
> As mentioned in KAFKA-15857[1], the goal is to add new entries for
> local-log-start-offset and tierd-offset in OffsetSpec. This will be
> used in AdminClient APIs and also to be added as part of
> GetOffsetShell. This was also raised by Kamal in the earlier email.
>
> OffsetSpec related changes for these entries also need to be mentioned
> as part of the PublicInterfaces section because these are exposed to
> users as public APIs through Admin#listOffsets() APIs[2, 3].
>
> Please update the KIP with the above details.
>
> 1. https://issues.apache.org/jira/browse/KAFKA-15857
> 2.
> https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/admin/Admin.java#L1238
> 3.
> https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/admin/Admin.java#L1226
>
> ~Satish.
>
> On Mon, 20 Nov 2023 at 18:35, Kamal Chandraprakash
>  wrote:
> >
> > Hi Christo,
> >
> > Thanks for the KIP!
> >
> > Similar to the earliest-local-log offset, can we also expose the
> > highest-copied-remote-offset via
> > GetOffsetShell tool? This will be useful during the debugging session.
> >
> >
> > On Mon, Nov 20, 2023 at 5:38 PM Christo Lolov 
> > wrote:
> >
> > > Hello all!
> > >
> > > I would like to start a discussion for
> > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1005%3A+Add+EarliestLocalOffset+to+GetOffsetShell
> > > .
> > >
> > > A new offset called local log start offset was introduced as part of
> > > KIP-405: Kafka Tiered Storage. KIP-1005 aims to expose this offset by
> > > changing the AdminClient and in particular the GetOffsetShell tool.
> > >
> > > I am looking forward to your suggestions for improvement!
> > >
> > > Best,
> > > Christo
> > >
>


Jenkins build is still unstable: Kafka » Kafka Branch Builder » trunk #2403

2023-11-21 Thread Apache Jenkins Server
See 




Build failed in Jenkins: Kafka » Kafka Branch Builder » 3.5 #99

2023-11-21 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 470584 lines...]

Gradle Test Run :streams:integrationTest > Gradle Test Executor 186 > 
TableTableJoinIntegrationTest > [caching enabled = true] > 
org.apache.kafka.streams.integration.TableTableJoinIntegrationTest.testInner[caching
 enabled = true] PASSED

Gradle Test Run :streams:integrationTest > Gradle Test Executor 186 > 
TableTableJoinIntegrationTest > [caching enabled = true] > 
org.apache.kafka.streams.integration.TableTableJoinIntegrationTest.testOuter[caching
 enabled = true] STARTED

Gradle Test Run :streams:integrationTest > Gradle Test Executor 186 > 
TableTableJoinIntegrationTest > [caching enabled = true] > 
org.apache.kafka.streams.integration.TableTableJoinIntegrationTest.testOuter[caching
 enabled = true] PASSED

Gradle Test Run :streams:integrationTest > Gradle Test Executor 186 > 
TableTableJoinIntegrationTest > [caching enabled = true] > 
org.apache.kafka.streams.integration.TableTableJoinIntegrationTest.testInnerWithVersionedStores[caching
 enabled = true] STARTED

Gradle Test Run :streams:integrationTest > Gradle Test Executor 186 > 
TableTableJoinIntegrationTest > [caching enabled = true] > 
org.apache.kafka.streams.integration.TableTableJoinIntegrationTest.testInnerWithVersionedStores[caching
 enabled = true] PASSED

Gradle Test Run :streams:integrationTest > Gradle Test Executor 186 > 
TableTableJoinIntegrationTest > [caching enabled = true] > 
org.apache.kafka.streams.integration.TableTableJoinIntegrationTest.testLeft[caching
 enabled = true] STARTED

Gradle Test Run :streams:integrationTest > Gradle Test Executor 186 > 
TableTableJoinIntegrationTest > [caching enabled = true] > 
org.apache.kafka.streams.integration.TableTableJoinIntegrationTest.testLeft[caching
 enabled = true] PASSED

Gradle Test Run :streams:integrationTest > Gradle Test Executor 186 > 
TableTableJoinIntegrationTest > [caching enabled = true] > 
org.apache.kafka.streams.integration.TableTableJoinIntegrationTest.testOuterWithVersionedStores[caching
 enabled = true] STARTED

Gradle Test Run :streams:integrationTest > Gradle Test Executor 186 > 
TableTableJoinIntegrationTest > [caching enabled = true] > 
org.apache.kafka.streams.integration.TableTableJoinIntegrationTest.testOuterWithVersionedStores[caching
 enabled = true] PASSED

Gradle Test Run :streams:integrationTest > Gradle Test Executor 186 > 
TableTableJoinIntegrationTest > [caching enabled = true] > 
org.apache.kafka.streams.integration.TableTableJoinIntegrationTest.testOuterWithRightVersionedOnly[caching
 enabled = true] STARTED

Gradle Test Run :streams:integrationTest > Gradle Test Executor 186 > 
TableTableJoinIntegrationTest > [caching enabled = true] > 
org.apache.kafka.streams.integration.TableTableJoinIntegrationTest.testOuterWithRightVersionedOnly[caching
 enabled = true] PASSED

Gradle Test Run :streams:integrationTest > Gradle Test Executor 186 > 
TableTableJoinIntegrationTest > [caching enabled = true] > 
org.apache.kafka.streams.integration.TableTableJoinIntegrationTest.testLeftWithVersionedStores[caching
 enabled = true] STARTED

Gradle Test Run :streams:integrationTest > Gradle Test Executor 186 > 
TableTableJoinIntegrationTest > [caching enabled = true] > 
org.apache.kafka.streams.integration.TableTableJoinIntegrationTest.testLeftWithVersionedStores[caching
 enabled = true] PASSED

Gradle Test Run :streams:integrationTest > Gradle Test Executor 186 > 
TableTableJoinIntegrationTest > [caching enabled = true] > 
org.apache.kafka.streams.integration.TableTableJoinIntegrationTest.testOuterWithLeftVersionedOnly[caching
 enabled = true] STARTED

Gradle Test Run :streams:integrationTest > Gradle Test Executor 186 > 
TableTableJoinIntegrationTest > [caching enabled = true] > 
org.apache.kafka.streams.integration.TableTableJoinIntegrationTest.testOuterWithLeftVersionedOnly[caching
 enabled = true] PASSED

Gradle Test Run :streams:integrationTest > Gradle Test Executor 186 > 
TableTableJoinIntegrationTest > [caching enabled = true] > 
org.apache.kafka.streams.integration.TableTableJoinIntegrationTest.testLeftWithRightVersionedOnly[caching
 enabled = true] STARTED

Gradle Test Run :streams:integrationTest > Gradle Test Executor 186 > 
TableTableJoinIntegrationTest > [caching enabled = true] > 
org.apache.kafka.streams.integration.TableTableJoinIntegrationTest.testLeftWithRightVersionedOnly[caching
 enabled = true] PASSED

Gradle Test Run :streams:integrationTest > Gradle Test Executor 186 > 
TableTableJoinIntegrationTest > [caching enabled = true] > 
org.apache.kafka.streams.integration.TableTableJoinIntegrationTest.testInnerInner[caching
 enabled = true] STARTED

Gradle Test Run :streams:integrationTest > Gradle Test Executor 186 > 
TableTableJoinIntegrationTest > [caching enabled = true] > 
org.apache.kafka.streams.integration.TableTableJoinIntegrationTest.testInnerInner[caching
 enabled = true] 

[jira] [Created] (KAFKA-15870) Move new group coordinator metrics from Yammer to Metrics

2023-11-21 Thread Jeff Kim (Jira)
Jeff Kim created KAFKA-15870:


 Summary: Move new group coordinator metrics from Yammer to Metrics
 Key: KAFKA-15870
 URL: https://issues.apache.org/jira/browse/KAFKA-15870
 Project: Kafka
  Issue Type: Sub-task
Reporter: Jeff Kim
Assignee: Jeff Kim






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15837) Throw error on use of Consumer.poll(long timeout)

2023-11-21 Thread Andrew Schofield (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Schofield resolved KAFKA-15837.
--
Resolution: Fixed

> Throw error on use of Consumer.poll(long timeout)
> -
>
> Key: KAFKA-15837
> URL: https://issues.apache.org/jira/browse/KAFKA-15837
> Project: Kafka
>  Issue Type: Improvement
>  Components: clients, consumer
>Reporter: Kirk True
>Assignee: Andrew Schofield
>Priority: Major
> Fix For: 3.7.0
>
>
> Per [KIP-266|https://cwiki.apache.org/confluence/x/5kiHB], the 
> Consumer.poll(long timeout) method was deprecated back in 2.0.0. The method 
> will now throw a KafkaException.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15871) Implement kafka-client-metrics.sh tool

2023-11-21 Thread Andrew Schofield (Jira)
Andrew Schofield created KAFKA-15871:


 Summary: Implement kafka-client-metrics.sh tool
 Key: KAFKA-15871
 URL: https://issues.apache.org/jira/browse/KAFKA-15871
 Project: Kafka
  Issue Type: Sub-task
  Components: admin
Affects Versions: 3.7.0
Reporter: Andrew Schofield
Assignee: Andrew Schofield
 Fix For: 3.7.0


Implement the `kafka-client-metrics.sh` tool which is introduced in KIP-714 and 
enhanced in KIP-1000.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] KIP-968: Support single-key_multi-timestamp interactive queries (IQv2) for versioned state stores

2023-11-21 Thread Matthias J. Sax

Thanks! SGTM.

Seems all open questions are resolved. Thanks for pushing this through!

-Matthias

On 11/21/23 2:29 AM, Alieh Saeedi wrote:

Yes Matthias,
Based on the discussion we had, it has now been changed to Optional and the
default is empty (for the latest). Also, the `validTo()` method returns an
Optional.

Cheers,
Alieh

On Mon, Nov 20, 2023 at 7:38 PM Matthias J. Sax  wrote:


I think we should also discuss a little more about `validTo()` method?

Given that "latest" version does not have a valid-to TS, should we
change the return type to `Optional` and return `empty()` for "latest"?

ATM the KIP uses `MAX_VALUE` for "latest" what seems to be less clean?
We could also use `-1` (unknown), but both might be less expressive than
`Optional`?


-Matthias

On 11/20/23 1:59 AM, Bruno Cadonna wrote:

Hi Alieh,

Although, I've already voted, I found a minor miss. You should also add
a method isDescending() since the results could also be unordered now
that we agreed that the results are unordered by default. If both --
isDescending() and isAscending -- are false neither
withDescendingTimestamps() nor withAscendingTimestamps() was called.

Best,
Bruno

On 11/17/23 11:25 AM, Alieh Saeedi wrote:

Hi all,
Thank you for the feedback.

So we agreed on no default ordering for keys and TSs. So I must provide
both withAscendingXx() and with DescendingXx() for the class.
Apart from that, I think we can either remove the existing constructor
for
the `VersionedRecord` class or follow the `Optional` thing.

Since many hidden aspects of the KIP are quite clear now and we have

come

to a consensus about them, I think it 's time to vote ;-)
I look forward to your votes. Thanks a lot.

Cheers,
Alieh

On Fri, Nov 17, 2023 at 2:27 AM Matthias J. Sax 

wrote:



Thanks, Alieh.

Overall SGTM. About `validTo` -- wondering if we should make it an
`Optional` and set to `empty()` by default?

I am totally ok with going with the 3-way option about ordering using
default "undefined". For this KIP (as it's all net new) nothing really
changes. -- However, we should amend `RangeQuery`/KIP-985 to align it.

Btw: so far we focused on key-ordering, but I believe the same

"ordering

undefined by default" would apply to time-ordering, too? This might
affect KIP-997, too.


-Matthias

On 11/16/23 12:51 AM, Bruno Cadonna wrote:

Hi,

80)
We do not keep backwards compatibility with IQv1, right? I would even
say that currently we do not need to keep backwards compatibility

among

IQv2 versions since we marked the API "Evolving" (do we only mean code
compatibility here or also behavioral compatibility?). I propose to

try

to not limit ourselves for backwards compatibility that we explicitly
marked as evolving.
I re-read the discussion on KIP-985. In that discussion, we were quite
focused on what the state store provides. I see that for range

queries,

we have methods on the state store interface that specify the order,
but
that should be kind of orthogonal to the IQv2 query type. Let's assume
somebody in the future adds a state store implementation that is not
order based. To account for use cases where the order does not matter,
this person might also add a method to the state store interface that
does not guarantee any order. However, our range query type is
specified
to guarantee order by default. So we need to add something like
withNoOrder() to the query type to allow the use cases that does not
need order and has the better performance in IQ. That does not look
very
nice to me. Having the no-order-guaranteed option does not cost us
anything and it keeps the IQv2 interface flexible. I assume we want to
drop the Evolving annotation at some point.
Sorry for not having brought this up in the discussion about KIP-985.

Best,
Bruno





On 11/15/23 6:56 AM, Matthias J. Sax wrote:

Just catching up on this one.


50) I am also in favor of setting `validTo` in VersionedRecord for
single-key single-ts lookup; it seems better to return the proper
timestamp. The timestamp is already in the store and it's cheap to
extract it and add to the result, and it might be valuable

information

for the user. Not sure though if we should deprecate the existing
constructor though, because for "latest" it's convenient to have?


60) Yes, I meant `VersionedRecord`. Sorry for the mixup.


80) We did discuss this question on KIP-985 (maybe you missed it
Bruno). It's kinda tricky.

Historically, it seems that IQv1, ie, the `ReadOnlyXxx` interfaces
provide a clear contract that `range()` is ascending and
`reverseRange()` is descending.

For `RangeQuery`, the question is, if we did implicitly inherit this
contract? Our conclusion on KIP-985 discussion was, that we did
inherit it. If this holds true, changing the contract would be a
breaking change (what might still be acceptable, given that the
interface is annotated as unstable, and that IQv2 is not widely
adopted yet). I am happy to go with the 3-option contract, but just
want to ensure we all agree it's the r

Re: [VOTE] KIP-968: Support single-key_multi-timestamp interactive queries (IQv2) for versioned state stores

2023-11-21 Thread Matthias J. Sax

+1 (binding)

On 11/21/23 4:52 AM, Lucas Brutschy wrote:

Hi Alieh,

thanks for the KIP!

+1 (binding)

Lucas

On Tue, Nov 21, 2023 at 11:26 AM Alieh Saeedi
 wrote:


Thanks, Matthias; I changed it to `ANY` which is the shortest and not
misleading.

Cheers,
Alieh

On Mon, Nov 20, 2023 at 7:42 PM Matthias J. Sax  wrote:


Adding an enum is a good idea!

Wondering if `UNORDERED` is the best name? Want to avoid bike shedding,
just asking.

We could also use `UNDEFINED` / `UNSPECIFIED` / `NONE` / `ANY` ?

In the end, the result _might_ be ordered, we just don't guarantee any
order.


-Matthias

On 11/20/23 9:17 AM, Alieh Saeedi wrote:

Hi all,
I added the public enum `ResultOrder` to the KIP which helps with keeping
three values (unordered, ascending, and descending) for the query

results.

Therefore the method `isAscending()` is changed to `resultOrder()` which
returns either the user specified result order or `unorderd`.
Cheers,
Alieh

On Mon, Nov 20, 2023 at 1:40 PM Alieh Saeedi 

wrote:



Thank you, Guozhag and Bruno, for reviewing the KIP and reading the

whole

discussion thread. I appreciate your help:)
The KIP is now corrected and updated.

Cheers,
Alieh

On Mon, Nov 20, 2023 at 10:43 AM Bruno Cadonna 

wrote:



Thanks Alieh,

I am +1 (binding).

However, although we agreed on not specifying an order of the results

by

default, there is still the following  sentence in the KIP:

"The order of the returned records is by default ascending by

timestamp.

The method withDescendingTimestamps() can reverse the order. Btw,
withAscendingTimestamps() method can be used for code readability
purpose. "

Could you please change it and also fix what Guozhang commented?

Best,
Bruno

On 11/19/23 2:12 AM, Guozhang Wang wrote:

Thanks Alieh,

I read through the wiki page and the DISCUSS thread, all LGTM except a
minor thing in javadoc:

"The query returns the records with a global ascending order of keys.
The records with the same key are ordered based on their insertion
timestamp in ascending order. Both the global and partial ordering are
modifiable with the corresponding methods defined for the class."

Since this KIP is only for a single key, there's no key ordering but
only timestamp ordering right? Maybe the javadoc can be updated
accordingly.

Otherwise, LGTM.

On Fri, Nov 17, 2023 at 2:36 AM Alieh Saeedi
 wrote:


Hi all,
Following my recent message in the discussion thread, I am opening

the

voting for KIP-968. Thanks for your votes in advance.

Cheers,
Alieh










[jira] [Created] (KAFKA-15872) Investigate autocommit retry logic

2023-11-21 Thread Philip Nee (Jira)
Philip Nee created KAFKA-15872:
--

 Summary: Investigate autocommit retry logic
 Key: KAFKA-15872
 URL: https://issues.apache.org/jira/browse/KAFKA-15872
 Project: Kafka
  Issue Type: Sub-task
Reporter: Philip Nee


This is purely an investigation ticket.

Currently, we send an autocommit only if there isn't an inflight one; however, 
this logic might not be correct because I think we should:
 # expires the request if it is not completed in time
 # always send an autocommit on the clock



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[DISCUSS] KIP-1008: ParKa - the Marriage of Parquet and Kafka

2023-11-21 Thread Xinli shang
Hi, all

Can I ask for a discussion on the KIP just created KIP-1008: ParKa - the
Marriage of Parquet and Kafka

?

-- 
Xinli Shang


Jenkins build is still unstable: Kafka » Kafka Branch Builder » trunk #2404

2023-11-21 Thread Apache Jenkins Server
See 




Re: [VOTE] KIP-1001; CurrentControllerId Metric

2023-11-21 Thread José Armando García Sancio
LGTM. +1 binding.

On Mon, Nov 20, 2023 at 1:48 PM Jason Gustafson
 wrote:
>
> The KIP makes sense. +1
>
> On Mon, Nov 20, 2023 at 12:37 PM David Arthur
>  wrote:
>
> > Thanks Colin,
> >
> > +1 from me
> >
> > -David
> >
> > On Tue, Nov 14, 2023 at 3:53 PM Colin McCabe  wrote:
> >
> > > Hi all,
> > >
> > > I'd like to call a vote for KIP-1001: Add CurrentControllerId metric.
> > >
> > > Take a look here:
> > > https://cwiki.apache.org/confluence/x/egyZE
> > >
> > > best,
> > > Colin
> > >
> >
> >
> > --
> > -David
> >



-- 
-José


Re: [DISCUSS] Road to Kafka 4.0

2023-11-21 Thread Colin McCabe
On Tue, Nov 21, 2023, at 03:47, Josep Prat wrote:
> Hi Colin,
>
> I think it's great that Confluent runs KRaft clusters in production, 
> and it means that it is production ready for Confluent and it's users. 
> But luckily for Kafka, the community is bigger than this (self managed 
> in the cloud or in-prem, or customers of other SaaS companies).

Hi Josep,

Confluent is not the only company using or developing KRaft. Most of the big 
organizations developing Kafka are involved. I mentioned Confluent's 
deployments because I wanted to be clear that KRaft mode is not experimental or 
new. Talking about software in production is a good way to clear up these 
misconceptions.

Indeed, KRaft mode is many years old. It started around 2020, and became 
production-ready in AK 3.5 in 2022. ZK mode was deprecated in AK 3.5, which was 
released June 2023. If we release AK 4.0 around April (or maybe a month or two 
later) then that will be almost a full year between deprecation and removal of 
ZK mode. We've talked about this a lot, in KIPs, in Apache blog posts, at 
conferences, and so forth.

> We've heard at least from 1 SaaS company, Aiven (disclaimer, it is my 
> employer) where the current feature set makes it not trivial to 
> migrate. This same issue might happen not only at Aiven but with any 
> user of Kafka who uses immutable infrastructure.

Can you discuss why you feel it is "not trivial to migrate"? From the 
discussion above, the main gap is that we should improve the documentation for 
handling failed disks.

> Another case is for 
> users that have hundreds (or more) of clusters and more than 100k nodes 
> experience node failures multiple times during a single day. In this 
> situation, not having KIP 853 makes these power users unable to join 
> the game as  introducing a new error-prone manual (or needed to 
> automate) operation is usually a huge no-go.

We have thousands of KRaft clusters in production and haven't seen these 
problems, as I described above.

best,
Colin

>
> But I hear the concerns of delaying 4.0 for another 3 to 4 months. 
> Would it help if we would aim at shortening the timeline for 3.8.0 and 
> start with the 4.0.0 a bit earlier help?
> Maybe we could work on 3.8.0 almost in parallel with 4.0.0:
> - Start with 3.8.0 release process
> - After a small time (let's say a week) create the release branch
> - Start with 4.0.0 release process as usual
> - Cherry pick KRaft related issues to 3.8.0
> - Release 3.8.0
> I suspect 4.0.0 will need a bit more time than usual to ensure the code 
> is cleaned up of deprecated classes and methods on top of the usual 
> work we have. For this reason I think there would be enough time 
> between releasing 3.8.0 and 4.0.0.
>
> What do you all think?
>
> Best,
> Josep Prat
>
> On 2023/11/20 20:03:18 Colin McCabe wrote:
>> Hi Josep,
>> 
>> I think there is some confusion here. Quorum reconfiguration is not needed 
>> for KRaft to become production ready. Confluent runs thousands of KRaft 
>> clusters without quorum reconfiguration, and has for years. While dynamic 
>> quorum reconfiguration is a nice feature, it doesn't block anything: not 
>> migration, not deployment. As best as I understand it, the use-case Aiven 
>> has isn't even reconfiguration per se, just wiping a disk. There are ways to 
>> handle this -- I discussed some earlier in the thread. I think it would be 
>> productive to continue that discussion -- especially the part around 
>> documentation and testing of these cases.
>> 
>> A lot of people have done a lot of work to get Kafka 4.0 ready. I would not 
>> want to delay that because we want an additional feature. And we will always 
>> want additional features. So I am concerned we will end up in an infinite 
>> loop of people asking for "just one more feature" before they migrate.
>> 
>> best,
>> Colin
>> 
>> 
>> On Mon, Nov 20, 2023, at 04:15, Josep Prat wrote:
>> > Hi all,
>> >
>> > I wanted to share my opinion regarding this topic. I know some 
>> > discussions happened some time ago (over a year) but I believe it's 
>> > wise to reflect and re-evaluate if those decisions are still valid.
>> > KRaft, as of Kafka 3.6.x and 3.7.x, has not yet feature parity with 
>> > Zookeeper. By dropping Zookeeper altogether before achieving such 
>> > parity, we are opening the door to leaving a chunk of Apache Kafka 
>> > users without an easy way to upgrade to 4.0.
>> > In pro of making upgrades as smooth as possible, I propose to have a 
>> > Kafka version where KIP-853 is merged and Zookeeper still is supported. 
>> > This will enable community members who can't migrate yet to KRaft to do 
>> > so in a safe way (rolling back is something goes wrong). Additionally, 
>> > this will give us more confidence on having KRaft replacing 
>> > successfully Zookeeper without any big problems by discovering and 
>> > fixing bugs or by confirming that KRaft works as expected.
>> > For this I strongly believe we should have a 3.8.x version before 

Re: [VOTE] KIP-1001; CurrentControllerId Metric

2023-11-21 Thread Colin McCabe
Thanks, everyone!

With binding +1s from:

José Armando García Sancio
Jason Gustafson
David Arthur
David Jacot

the KIP passes.

regards,
Colin

On Tue, Nov 21, 2023, at 09:25, José Armando García Sancio wrote:
> LGTM. +1 binding.
>
> On Mon, Nov 20, 2023 at 1:48 PM Jason Gustafson
>  wrote:
>>
>> The KIP makes sense. +1
>>
>> On Mon, Nov 20, 2023 at 12:37 PM David Arthur
>>  wrote:
>>
>> > Thanks Colin,
>> >
>> > +1 from me
>> >
>> > -David
>> >
>> > On Tue, Nov 14, 2023 at 3:53 PM Colin McCabe  wrote:
>> >
>> > > Hi all,
>> > >
>> > > I'd like to call a vote for KIP-1001: Add CurrentControllerId metric.
>> > >
>> > > Take a look here:
>> > > https://cwiki.apache.org/confluence/x/egyZE
>> > >
>> > > best,
>> > > Colin
>> > >
>> >
>> >
>> > --
>> > -David
>> >
>
>
>
> -- 
> -José


Re: [VOTE] KIP-968: Support single-key_multi-timestamp interactive queries (IQv2) for versioned state stores

2023-11-21 Thread Alieh Saeedi
Thanks to all for voting. So I consider KIP-968 as accepted.

Cheers,
Alieh

On Tue, Nov 21, 2023 at 5:22 PM Matthias J. Sax  wrote:

> +1 (binding)
>
> On 11/21/23 4:52 AM, Lucas Brutschy wrote:
> > Hi Alieh,
> >
> > thanks for the KIP!
> >
> > +1 (binding)
> >
> > Lucas
> >
> > On Tue, Nov 21, 2023 at 11:26 AM Alieh Saeedi
> >  wrote:
> >>
> >> Thanks, Matthias; I changed it to `ANY` which is the shortest and not
> >> misleading.
> >>
> >> Cheers,
> >> Alieh
> >>
> >> On Mon, Nov 20, 2023 at 7:42 PM Matthias J. Sax 
> wrote:
> >>
> >>> Adding an enum is a good idea!
> >>>
> >>> Wondering if `UNORDERED` is the best name? Want to avoid bike shedding,
> >>> just asking.
> >>>
> >>> We could also use `UNDEFINED` / `UNSPECIFIED` / `NONE` / `ANY` ?
> >>>
> >>> In the end, the result _might_ be ordered, we just don't guarantee any
> >>> order.
> >>>
> >>>
> >>> -Matthias
> >>>
> >>> On 11/20/23 9:17 AM, Alieh Saeedi wrote:
>  Hi all,
>  I added the public enum `ResultOrder` to the KIP which helps with
> keeping
>  three values (unordered, ascending, and descending) for the query
> >>> results.
>  Therefore the method `isAscending()` is changed to `resultOrder()`
> which
>  returns either the user specified result order or `unorderd`.
>  Cheers,
>  Alieh
> 
>  On Mon, Nov 20, 2023 at 1:40 PM Alieh Saeedi 
> >>> wrote:
> 
> > Thank you, Guozhag and Bruno, for reviewing the KIP and reading the
> >>> whole
> > discussion thread. I appreciate your help:)
> > The KIP is now corrected and updated.
> >
> > Cheers,
> > Alieh
> >
> > On Mon, Nov 20, 2023 at 10:43 AM Bruno Cadonna 
> >>> wrote:
> >
> >> Thanks Alieh,
> >>
> >> I am +1 (binding).
> >>
> >> However, although we agreed on not specifying an order of the
> results
> >>> by
> >> default, there is still the following  sentence in the KIP:
> >>
> >> "The order of the returned records is by default ascending by
> >>> timestamp.
> >> The method withDescendingTimestamps() can reverse the order. Btw,
> >> withAscendingTimestamps() method can be used for code readability
> >> purpose. "
> >>
> >> Could you please change it and also fix what Guozhang commented?
> >>
> >> Best,
> >> Bruno
> >>
> >> On 11/19/23 2:12 AM, Guozhang Wang wrote:
> >>> Thanks Alieh,
> >>>
> >>> I read through the wiki page and the DISCUSS thread, all LGTM
> except a
> >>> minor thing in javadoc:
> >>>
> >>> "The query returns the records with a global ascending order of
> keys.
> >>> The records with the same key are ordered based on their insertion
> >>> timestamp in ascending order. Both the global and partial ordering
> are
> >>> modifiable with the corresponding methods defined for the class."
> >>>
> >>> Since this KIP is only for a single key, there's no key ordering
> but
> >>> only timestamp ordering right? Maybe the javadoc can be updated
> >>> accordingly.
> >>>
> >>> Otherwise, LGTM.
> >>>
> >>> On Fri, Nov 17, 2023 at 2:36 AM Alieh Saeedi
> >>>  wrote:
> 
>  Hi all,
>  Following my recent message in the discussion thread, I am opening
> >>> the
>  voting for KIP-968. Thanks for your votes in advance.
> 
>  Cheers,
>  Alieh
> >>
> >
> 
> >>>
>


Re: [DISCUSS] KIP-896: Remove old client protocol API versions in Kafka 4.0

2023-11-21 Thread Ismael Juma
Hi Jose,

I updated the KIP to include a new metric for deprecated request api
versions and also a new attribute in the request log to make it easy to
find such entries,

Thanks,
Ismael

On Thu, Jan 12, 2023 at 1:03 AM Ismael Juma  wrote:

> Hi Jose,
>
> I think it's reasonable to add more user-friendly metrics as you
> described. I'll update the KIP soon with that. I'll try to define them in a
> way where they track deprecated protocols for the next major release. That
> way, they can be useful even after AK 4.0 is released.
>
> Ismael
>
> On Wed, Jan 11, 2023 at 12:34 PM José Armando García Sancio
>  wrote:
>
>> Thanks Ismael.
>>
>> > The following metrics are used to determine both questions:
>> > >
>> > >- Client name and version:
>> > >
>> kafka.server:clientSoftwareName=(client-software-name),clientSoftwareVersion=(client-software-version),listener=(listener),networkProcessor=(processor-index),type=(type)
>> > >- Request name and version:
>> > >
>> kafka.network:type=RequestMetrics,name=RequestsPerSec,request=(api-name),version=(api-version)}
>> > >
>> > >
>> > Are you suggesting that this is too complicated and hence we should add
>> a
>> > metric that tracks AK 4.0 support explicitly?
>>
>> Correct. It doesn't look trivial for the users to implement this check
>> against the RequestMetrics. I was wondering if it is worth it for
>> Kafka to implement this for them and expose a simple metric that they
>> can check.
>>
>> --
>> -José
>>
>


[VOTE] KIP-896: Remove old client protocol API versions in Kafka 4.0

2023-11-21 Thread Ismael Juma
Hi all,

I would like to start a vote on KIP-896. Please take a look and let us know
what you think.

Even though most of the changes in this KIP will be done for Apache Kafka
4.0, I would like to introduce a new metric and new request log attribute
in Apache 3.7 to help users identify usage of deprecated protocol api
versions.

Link:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-896%3A+Remove+old+client+protocol+API+versions+in+Kafka+4.0

Thanks,
Ismael


Re: [DISCUSS] KIP-896: Remove old client protocol API versions in Kafka 4.0

2023-11-21 Thread Ismael Juma
I started a vote thread for this since I addressed the comments so far and
it doesn't seem like there were any major concerns.

Ismael

On Tue, Jan 3, 2023 at 8:17 AM Ismael Juma  wrote:

> Hi all,
>
> I would like to start a discussion regarding the removal of very old
> client protocol API versions in Apache Kafka 4.0 to improve maintainability
> & supportability of Kafka. Please take a look at the proposal:
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-896%3A+Remove+old+client+protocol+API+versions+in+Kafka+4.0
>
> Ismael
>


[DISCUSS] KIP-1009: Add Broker-level Throttle Configurations

2023-11-21 Thread Ria Pradeep (BLOOMBERG/ 919 3RD A)
Hi All,

I'd like to start a discussion on KIP-1009: Add Broker-level Throttle 
Configurations.

Re: [VOTE] KIP-896: Remove old client protocol API versions in Kafka 4.0

2023-11-21 Thread Colin McCabe
Hi Ismael,

Can we state somewhere that the message.format.version configuration will be 
gone in 4.0? We only will support one message format version (for now, at 
least). If we do want more versions later, I don't think we'll want to 
configure them via a static config.

best,
Colin


On Tue, Nov 21, 2023, at 12:06, Ismael Juma wrote:
> Hi all,
>
> I would like to start a vote on KIP-896. Please take a look and let us know
> what you think.
>
> Even though most of the changes in this KIP will be done for Apache Kafka
> 4.0, I would like to introduce a new metric and new request log attribute
> in Apache 3.7 to help users identify usage of deprecated protocol api
> versions.
>
> Link:
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-896%3A+Remove+old+client+protocol+API+versions+in+Kafka+4.0
>
> Thanks,
> Ismael


Re:[DISCUSS] KIP-1009: Add Broker-level Throttle Configurations

2023-11-21 Thread Ria Pradeep (BLOOMBERG/ 919 3RD A)
Hi All,

(I sent the previous email prematurely, apologies)

I'd like to start a discussion on KIP-1009: Add Broker-level Throttle 
Configurations - 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-1009%3A+Add+Broker-level+Throttle+Configurations

At a high level, this KIP adds dynamic broker level configurations to throttle 
out of sync replication. Currently, replication throttles can be configured at 
a topic level, but not for the broker as a whole.

I’m looking forward to hearing any suggestions or thoughts on this KIP!

Thanks,
Ria


From: dev@kafka.apache.org At: 11/21/23 15:17:15 UTC-5:00To:  
dev@kafka.apache.org
Subject: [DISCUSS] KIP-1009: Add Broker-level Throttle Configurations

Hi All,

I'd like to start a discussion on KIP-1009: Add Broker-level Throttle 
Configurations.



Re: [VOTE] KIP-896: Remove old client protocol API versions in Kafka 4.0

2023-11-21 Thread Ismael Juma
Hi Colin,

That change was proposed and approved via KIP-724:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-724%3A+Drop+support+for+message+formats+v0+and+v1

Ismael

On Tue, Nov 21, 2023, 12:21 PM Colin McCabe  wrote:

> Hi Ismael,
>
> Can we state somewhere that the message.format.version configuration will
> be gone in 4.0? We only will support one message format version (for now,
> at least). If we do want more versions later, I don't think we'll want to
> configure them via a static config.
>
> best,
> Colin
>
>
> On Tue, Nov 21, 2023, at 12:06, Ismael Juma wrote:
> > Hi all,
> >
> > I would like to start a vote on KIP-896. Please take a look and let us
> know
> > what you think.
> >
> > Even though most of the changes in this KIP will be done for Apache Kafka
> > 4.0, I would like to introduce a new metric and new request log attribute
> > in Apache 3.7 to help users identify usage of deprecated protocol api
> > versions.
> >
> > Link:
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-896%3A+Remove+old+client+protocol+API+versions+in+Kafka+4.0
> >
> > Thanks,
> > Ismael
>


Re: [DISCUSS] Should we continue to merge without a green build? No!

2023-11-21 Thread Sophie Blee-Goldman
So...where did we land on this?

To take a few steps back and make sure we're all talking about the same
thing, I want to mention that I myself was responsible for merging a PR
that broke the build a few weeks ago. There was a warning that only
appeared in some of the versions, but when checking on the results I
immediately navigated to the "Tests" page before letting the main build
page load. I just assumed that if there were test failures that must mean
the build itself was fine, and with this tunnel-vision I missed that a
subset of JDK builds had failed.

I'm mentioning this now not to make excuses, but because I really hope that
it goes without saying that this was an honest (though sloppy) mistake, and
that no one genuinely believes it is acceptable to merge any PR that causes
any of the builds to fail to compile. Yet multiple people in this thread so
far have voiced support for "gating merges on the successful completion of
all parts of the build except tests". Just to be totally clear, I really
don't think that was ever in question -- though it certainly doesn't hurt
to remind everyone.

So, this thread is not about whether or not to merge with failing
*builds, *but it's
whether it should be acceptable to merge with failing *tests.* It seems
like we're in agreement for the most part that the current system isn't
great, but I don't think we can (or rather, should) start enforcing the new
rule to only merge fully-green builds until we can actually get to a
reasonable percentage of green builds. Do we have any stats on how often
builds are failing right now due to flaky tests? Just speaking from
personal experience, it's been a while since I've seen a truly green build
with all tests passing.

Lastly, just to play devil's advocate for a moment, before we commit to
this I think we need to consider how such a policy would impact the
contributor experience. It's incredibly disheartening to go through all the
work of submitting and getting it reviewed only to be blocked from merging
at the last minute due to test failures completely outside that
contributor's control. We already struggle just getting some folks, new
contributors in particular, all the way through the review process without
abandoning their PRs. I do think we've been doing a better job lately, but
it's a cycle that ebbs and flows, and most community PRs still take an
admirable degree of patience and persistence just to get enough reviews to
reach the point of merging. So I just think we need to be careful not to
make this situation even worse by having to wait for a green build. I'm
just worried about our ability to stay on top of disabling tests,
especially if we need to wait for someone with enough context to make a
judgement call. Can we really rely on everyone to drop everything at any
time to check on a failing test? At the same time I would be hesitant to be
overly aggressive about reverting/disabling tests without having someone
who understands the context take a look.

That said, I do think it's worth trying this out as an experiment, as long
as we can be clear to frustrated contributors that this isn't necessarily
the new policy from here on out if it isn't going well.

On Thu, Nov 16, 2023 at 3:32 AM Igor Soarez  wrote:

> Hi all,
>
> I think at least one of those is my fault, apologies.
> I'll try to make sure all my tests are passing from now on.
>
> It doesn't help that GitHub always shows that the tests have failed,
> even when they have not. I suspect this is because Jenkins always
> marks the builds as unstable, even when all tests pass, because
> the "Archive JUnit-formatted test results" step seems to persistently
> fail with "[Checks API] No suitable checks publisher found.".
> e.g.
> https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-14770/1/pipeline/
>
> Can we get rid of that persistent failure and actually mark successful
> test runs as green?
>
> --
> Igor
>


Re: [DISCUSS] Should we continue to merge without a green build? No!

2023-11-21 Thread Sophie Blee-Goldman
In the interest of moving things forward, here is what we would need (in my
opinion) to start enforcing this:

   1. Get overall build failure percentage under a certain threshold
  1. What is an acceptable number here?
  2. How do we achieve this: wait until all of them are fixed, disable
  everything that's flaky right away, etc
   2. Come up with concrete policy rules so there's no confusion. I think
   we need to agree on answers for these questions at least:
   1. What happens if a new PR introduces a new test that is revealed to be
  flaky?
  2. What happens if a new PR makes an old test become flaky?
  3. How long do we have to take action in the above cases?
  4. How do we track failures?
  5. How often does a test need to fail to be considered flaky enough
  to take action?

Here's my take on these questions, but would love to hear from others:

1.1) Imo the failure rate has to be under 50% at the very highest. At 50 we
still have half the builds failing, but 75% would pass after just one
retry, and 87.5% after two retries. Is that acceptable? Maybe we should aim
to kick things off with a higher success rate to give ourselves some wiggle
room over time, and have 50% be the absolute maximum failure rate -- if it
ever gets beyond that we trigger an emergency response (whether that be
blocking all feature work until the test failures are addressed, ending
this policy, etc)
1.2) I think we'd probably all agree that there's no way we'll be able to
triage all of the currently flaky tests in a reasonable time frame, but I'm
also wary of just disabling everything and forgetting them. Should we
consider making all of the currently-disabled tests a 3.7 release blocker?
Should we wait until after 3.7 to disable them and implement this policy to
give us a longer time to address them? Will we just procrastinate until the
next release deadline anyways? Many open questions here

2.1) We can either revert the entire PR or just disable the specific
test(s) that are failing. Personally I'm in favor of giving a short window
(see point #3) for the author to attempt a fix or else disable the specific
failing test, otherwise we revert the entire commit. This should guarantee
that tests aren't just introduced, disabled, and never looked at again. My
main concern here is that sometimes reverting a commit can mean resolving
merge conflicts that not everyone will have the context to do properly.
This is just another reason to keep the window short.
2.2) Again I think we should allow a (short) window of time for someone
with context to triage and take action, but after that, we need to either
revert the PR or disable the test. Neither of those feels like a good
option, since new changes can often induce timeout-related failures in my
experience, which are solely a problem with the test and not with the
change. Maybe if the test is failing very frequently/most of the time, then
we should revert the change, but if it becomes only occasionally flaky,
then we should disable the test, file a ticket, and mark it a blocker until
it can be triaged (the triaging can of course include demoting it from a
blocker if someone with context determines the test itself is at fault and
offers little value).
2.3) I think any longer than a day (or over the weekend) starts to become
an issue. Of course, we can/will be more lenient with tests that fail less
often, especially since it might just take a while to notice
2.4) Presumably the best way is to file tickets for every failure and
comment on that ticket for each subsequent failure. I used to do this quite
religiously, but I have to admit, it was actually a non-trivial timesuck.
If we can get to a low enough level of failure, however, this should be
more reasonable. We just need to trust everyone to pay attention and follow
up on failures they see. And will probably need to come up with a system to
avoid overcounting by different reviewers of the same PR/build
2.5) This one is actually the most tricky in my opinion. I think everyone
would agree that if a test is failing on all or most of the JDK runs within
a single build, it is problematic. And probably most of us would agree that
if it fails even once per build on almost all PRs, we need to take action.
On the other hand, a single test that fails once every 50 builds might feel
acceptable, but if we have 100 such tests, suddenly it's extremely
difficult to get a clean build. So what's the threshold for action?

So...any thoughts on how to implement this policy? And when to actually
start enforcing it?

On Tue, Nov 21, 2023 at 1:18 PM Sophie Blee-Goldman 
wrote:

> So...where did we land on this?
>
> To take a few steps back and make sure we're all talking about the same
> thing, I want to mention that I myself was responsible for merging a PR
> that broke the build a few weeks ago. There was a warning that only
> appeared in some of the versions, but when checking on the results I
> immediately navigated 

Jenkins build is still unstable: Kafka » Kafka Branch Builder » trunk #2405

2023-11-21 Thread Apache Jenkins Server
See 




Re: [DISCUSS] Should we continue to merge without a green build? No!

2023-11-21 Thread Sophie Blee-Goldman
For some concrete data, here are the stats for the latest build on two
community PRs I am currently reviewing:

https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-14648/16/tests
- 18 unrelated test failures
- 13 unique tests
- only 1 out of 4 JDK builds were green with all tests passing

https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-14735/4/tests
- 44(!) unrelated test failures
- not even going to count up the unique tests because there were too many
- 0 out of 4 JDK builds were green
- this particular build seemed to have a number of infra/timeout issues, so
it may have exacerbated the flakiness beyond the "usual", although as
others have noted, unstable infra is not uncommon

My point is, we clearly have a long way to go if we want to start enforcing
this policy and have any hope of merging any PRs and not driving away our
community of contributors.

This isn't meant to discourage anyone, actually the opposite: if we want to
start gating PRs on passing builds, we need to start tackling flaky tests
now!

On Tue, Nov 21, 2023 at 1:19 PM Sophie Blee-Goldman 
wrote:

> In the interest of moving things forward, here is what we would need (in
> my opinion) to start enforcing this:
>
>1. Get overall build failure percentage under a certain threshold
>   1. What is an acceptable number here?
>   2. How do we achieve this: wait until all of them are fixed,
>   disable everything that's flaky right away, etc
>2. Come up with concrete policy rules so there's no confusion. I think
>we need to agree on answers for these questions at least:
>1. What happens if a new PR introduces a new test that is revealed to
>   be flaky?
>   2. What happens if a new PR makes an old test become flaky?
>   3. How long do we have to take action in the above cases?
>   4. How do we track failures?
>   5. How often does a test need to fail to be considered flaky enough
>   to take action?
>
> Here's my take on these questions, but would love to hear from others:
>
> 1.1) Imo the failure rate has to be under 50% at the very highest. At 50
> we still have half the builds failing, but 75% would pass after just one
> retry, and 87.5% after two retries. Is that acceptable? Maybe we should aim
> to kick things off with a higher success rate to give ourselves some wiggle
> room over time, and have 50% be the absolute maximum failure rate -- if it
> ever gets beyond that we trigger an emergency response (whether that be
> blocking all feature work until the test failures are addressed, ending
> this policy, etc)
> 1.2) I think we'd probably all agree that there's no way we'll be able to
> triage all of the currently flaky tests in a reasonable time frame, but I'm
> also wary of just disabling everything and forgetting them. Should we
> consider making all of the currently-disabled tests a 3.7 release blocker?
> Should we wait until after 3.7 to disable them and implement this policy to
> give us a longer time to address them? Will we just procrastinate until the
> next release deadline anyways? Many open questions here
>
> 2.1) We can either revert the entire PR or just disable the specific
> test(s) that are failing. Personally I'm in favor of giving a short window
> (see point #3) for the author to attempt a fix or else disable the specific
> failing test, otherwise we revert the entire commit. This should guarantee
> that tests aren't just introduced, disabled, and never looked at again. My
> main concern here is that sometimes reverting a commit can mean resolving
> merge conflicts that not everyone will have the context to do properly.
> This is just another reason to keep the window short.
> 2.2) Again I think we should allow a (short) window of time for someone
> with context to triage and take action, but after that, we need to either
> revert the PR or disable the test. Neither of those feels like a good
> option, since new changes can often induce timeout-related failures in my
> experience, which are solely a problem with the test and not with the
> change. Maybe if the test is failing very frequently/most of the time, then
> we should revert the change, but if it becomes only occasionally flaky,
> then we should disable the test, file a ticket, and mark it a blocker until
> it can be triaged (the triaging can of course include demoting it from a
> blocker if someone with context determines the test itself is at fault and
> offers little value).
> 2.3) I think any longer than a day (or over the weekend) starts to become
> an issue. Of course, we can/will be more lenient with tests that fail less
> often, especially since it might just take a while to notice
> 2.4) Presumably the best way is to file tickets for every failure and
> comment on that ticket for each subsequent failure. I used to do this quite
> religiously, but I have to admit, it was actually a non-trivial timesuck.
> If we can get to a low enough leve

[jira] [Resolved] (KAFKA-15215) The default.dsl.store config is not compatible with custom state stores

2023-11-21 Thread Almog Gavra (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Almog Gavra resolved KAFKA-15215.
-
Fix Version/s: 3.7.0
   Resolution: Fixed

> The default.dsl.store config is not compatible with custom state stores
> ---
>
> Key: KAFKA-15215
> URL: https://issues.apache.org/jira/browse/KAFKA-15215
> Project: Kafka
>  Issue Type: New Feature
>  Components: streams
>Reporter: A. Sophie Blee-Goldman
>Assignee: Almog Gavra
>Priority: Major
>  Labels: needs-kip
> Fix For: 3.7.0
>
>
> Sort of a bug, sort of a new/missing feature. When we added the long-awaited 
> default.dsl.store config, it was decided to scope the initial KIP to just the 
> two out-of-the-box state stores types offered by Streams, rocksdb and 
> in-memory. The reason being that this would address a large number of the 
> relevant use cases, and could always be followed up with another KIP for 
> custom state stores if/when the demand arose.
> Of course, since rocksdb is the default anyways, the only beneficiaries of 
> this KIP right now are the people who specifically want only in-memory stores 
> – yet custom state stores users are probably by far the ones with the 
> greatest need for an easier way to configure the store type across an entire 
> application. And unfortunately, because the config currently relies on enum 
> definitions for the known OOTB store types, there's not really any way to 
> extend this feature as it is to work with custom implementations.
> I think this is a great feature, which is why I hope to see it extended to 
> the broader user base. Most likely we'll want to introduce a new config for 
> this, though whether it replaces the old default.dsl.store config or 
> complements it will have to be decided during the KIP discussion



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15774) Respect default.dsl.store Configuration Without Passing it to StreamsBuilder

2023-11-21 Thread Almog Gavra (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Almog Gavra resolved KAFKA-15774.
-
Fix Version/s: 3.7.0
   Resolution: Fixed

Note that we decided not to have default.dsl.store work if you only pass it in 
to the main KafkaStreams constructor for backwards compatibility. Instead you 
should use the new dsl.store.suppliers configuration

> Respect default.dsl.store Configuration Without Passing it to StreamsBuilder
> 
>
> Key: KAFKA-15774
> URL: https://issues.apache.org/jira/browse/KAFKA-15774
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Almog Gavra
>Assignee: Almog Gavra
>Priority: Major
> Fix For: 3.7.0
>
>
> Currently if you only configure `default.dsl.store` as `in_memory` in your 
> `StreamsConfig` it will silently be ignored unless it's also passed into 
> `StreamsBuilder#new(TopologyConfig)`. We should improve this behavior to 
> properly respect it.
> This will become more important with the introduction of KIP-954.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] KIP-896: Remove old client protocol API versions in Kafka 4.0

2023-11-21 Thread Colin McCabe
Ah. I forget that KIP-724 not only deprecated, but proposed a removal in 4.0. 
Great.

+1 (binding) for KIP-896

best,
Colin

On Tue, Nov 21, 2023, at 12:36, Ismael Juma wrote:
> Hi Colin,
>
> That change was proposed and approved via KIP-724:
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-724%3A+Drop+support+for+message+formats+v0+and+v1
>
> Ismael
>
> On Tue, Nov 21, 2023, 12:21 PM Colin McCabe  wrote:
>
>> Hi Ismael,
>>
>> Can we state somewhere that the message.format.version configuration will
>> be gone in 4.0? We only will support one message format version (for now,
>> at least). If we do want more versions later, I don't think we'll want to
>> configure them via a static config.
>>
>> best,
>> Colin
>>
>>
>> On Tue, Nov 21, 2023, at 12:06, Ismael Juma wrote:
>> > Hi all,
>> >
>> > I would like to start a vote on KIP-896. Please take a look and let us
>> know
>> > what you think.
>> >
>> > Even though most of the changes in this KIP will be done for Apache Kafka
>> > 4.0, I would like to introduce a new metric and new request log attribute
>> > in Apache 3.7 to help users identify usage of deprecated protocol api
>> > versions.
>> >
>> > Link:
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-896%3A+Remove+old+client+protocol+API+versions+in+Kafka+4.0
>> >
>> > Thanks,
>> > Ismael
>>


Re: [VOTE] KIP-896: Remove old client protocol API versions in Kafka 4.0

2023-11-21 Thread Jun Rao
Hi, Ismael,

Thanks for the KIP. +1

It would be useful to clarify in the KIP that the new metric and new
request log attribute will be added in Apache 3.7.

Jun

On Tue, Nov 21, 2023 at 1:57 PM Colin McCabe  wrote:

> Ah. I forget that KIP-724 not only deprecated, but proposed a removal in
> 4.0. Great.
>
> +1 (binding) for KIP-896
>
> best,
> Colin
>
> On Tue, Nov 21, 2023, at 12:36, Ismael Juma wrote:
> > Hi Colin,
> >
> > That change was proposed and approved via KIP-724:
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-724%3A+Drop+support+for+message+formats+v0+and+v1
> >
> > Ismael
> >
> > On Tue, Nov 21, 2023, 12:21 PM Colin McCabe  wrote:
> >
> >> Hi Ismael,
> >>
> >> Can we state somewhere that the message.format.version configuration
> will
> >> be gone in 4.0? We only will support one message format version (for
> now,
> >> at least). If we do want more versions later, I don't think we'll want
> to
> >> configure them via a static config.
> >>
> >> best,
> >> Colin
> >>
> >>
> >> On Tue, Nov 21, 2023, at 12:06, Ismael Juma wrote:
> >> > Hi all,
> >> >
> >> > I would like to start a vote on KIP-896. Please take a look and let us
> >> know
> >> > what you think.
> >> >
> >> > Even though most of the changes in this KIP will be done for Apache
> Kafka
> >> > 4.0, I would like to introduce a new metric and new request log
> attribute
> >> > in Apache 3.7 to help users identify usage of deprecated protocol api
> >> > versions.
> >> >
> >> > Link:
> >> >
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-896%3A+Remove+old+client+protocol+API+versions+in+Kafka+4.0
> >> >
> >> > Thanks,
> >> > Ismael
> >>
>


[jira] [Created] (KAFKA-15873) Improve the performance of the DescribeTopicPartitions API

2023-11-21 Thread Calvin Liu (Jira)
Calvin Liu created KAFKA-15873:
--

 Summary: Improve the performance of the DescribeTopicPartitions API
 Key: KAFKA-15873
 URL: https://issues.apache.org/jira/browse/KAFKA-15873
 Project: Kafka
  Issue Type: Sub-task
Reporter: Calvin Liu


The current API involves sorting, copying, checking topics which will be out of 
the response limit. We should think about how to improve the performance for 
this API as it will be a main API for querying partitions. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] KIP-896: Remove old client protocol API versions in Kafka 4.0

2023-11-21 Thread José Armando García Sancio
Thanks. LGTM. +1.

On Tue, Nov 21, 2023 at 2:54 PM Jun Rao  wrote:
>
> Hi, Ismael,
>
> Thanks for the KIP. +1
>
> It would be useful to clarify in the KIP that the new metric and new
> request log attribute will be added in Apache 3.7.
>
> Jun
>
> On Tue, Nov 21, 2023 at 1:57 PM Colin McCabe  wrote:
>
> > Ah. I forget that KIP-724 not only deprecated, but proposed a removal in
> > 4.0. Great.
> >
> > +1 (binding) for KIP-896
> >
> > best,
> > Colin
> >
> > On Tue, Nov 21, 2023, at 12:36, Ismael Juma wrote:
> > > Hi Colin,
> > >
> > > That change was proposed and approved via KIP-724:
> > >
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-724%3A+Drop+support+for+message+formats+v0+and+v1
> > >
> > > Ismael
> > >
> > > On Tue, Nov 21, 2023, 12:21 PM Colin McCabe  wrote:
> > >
> > >> Hi Ismael,
> > >>
> > >> Can we state somewhere that the message.format.version configuration
> > will
> > >> be gone in 4.0? We only will support one message format version (for
> > now,
> > >> at least). If we do want more versions later, I don't think we'll want
> > to
> > >> configure them via a static config.
> > >>
> > >> best,
> > >> Colin
> > >>
> > >>
> > >> On Tue, Nov 21, 2023, at 12:06, Ismael Juma wrote:
> > >> > Hi all,
> > >> >
> > >> > I would like to start a vote on KIP-896. Please take a look and let us
> > >> know
> > >> > what you think.
> > >> >
> > >> > Even though most of the changes in this KIP will be done for Apache
> > Kafka
> > >> > 4.0, I would like to introduce a new metric and new request log
> > attribute
> > >> > in Apache 3.7 to help users identify usage of deprecated protocol api
> > >> > versions.
> > >> >
> > >> > Link:
> > >> >
> > >>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-896%3A+Remove+old+client+protocol+API+versions+in+Kafka+4.0
> > >> >
> > >> > Thanks,
> > >> > Ismael
> > >>
> >



-- 
-José


Jenkins build is still unstable: Kafka » Kafka Branch Builder » trunk #2406

2023-11-21 Thread Apache Jenkins Server
See 




Re: [DISCUSS] KIP-974 Docker Image for GraalVM based Native Kafka Broker

2023-11-21 Thread Justine Olshan
Hey -- just catching up here, since I saw the vote thread. I had 2
questions that I'm not sure got answered from the previous discussion.

1. Can we update the KIP to include the name of the other image so if
someone stumbles across this KIP they know the name of the other one?
2. Did we cover what "experimental" means here? I think Ismael asked
> Can we talk a bit more about the compatibility guarantees while this
image is still experimental?
I took this to mean should we be able to upgrade to or from this image or
if clusters running on it can only run on it? Or if there are no
guarantees about the upgrade/downgrade story.

Thanks,
Justine

On Sun, Nov 19, 2023 at 7:54 PM Krishna Agarwal <
krishna0608agar...@gmail.com> wrote:

> Hi,
> Thanks for the insightful feedback on this KIP.
> As there are no ongoing discussions, I'm considering moving into the voting
> process.
> Your continued input is greatly appreciated!
>
> Regards,
> Krishna
>
> On Fri, Sep 8, 2023 at 12:47 PM Krishna Agarwal <
> krishna0608agar...@gmail.com> wrote:
>
> > Hi,
> > I want to submit a KIP to deliver an experimental Apache Kafka docker
> > image.
> > The proposed docker image can launch brokers with sub-second startup time
> > and minimal memory footprint by leveraging a GraalVM based native Kafka
> > binary.
> >
> > KIP-974: Docker Image for GraalVM based Native Kafka Broker
> > <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-974%3A+Docker+Image+for+GraalVM+based+Native+Kafka+Broker
> >
> >
> > Regards,
> > Krishna
> >
>


[jira] [Created] (KAFKA-15874) Add metric and request log attribute for deprecated request api versions

2023-11-21 Thread Ismael Juma (Jira)
Ismael Juma created KAFKA-15874:
---

 Summary: Add metric and request log attribute for deprecated 
request api versions
 Key: KAFKA-15874
 URL: https://issues.apache.org/jira/browse/KAFKA-15874
 Project: Kafka
  Issue Type: Sub-task
Reporter: Ismael Juma
Assignee: Ismael Juma
 Fix For: 3.7.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] KIP-896: Remove old client protocol API versions in Kafka 4.0

2023-11-21 Thread Ismael Juma
Thanks Jun. I updated the KIP with this information and also linked to a
JIRA that captures the items required for 3.7.

Ismael

On Tue, Nov 21, 2023 at 2:53 PM Jun Rao  wrote:

> Hi, Ismael,
>
> Thanks for the KIP. +1
>
> It would be useful to clarify in the KIP that the new metric and new
> request log attribute will be added in Apache 3.7.
>
> Jun
>
> On Tue, Nov 21, 2023 at 1:57 PM Colin McCabe  wrote:
>
> > Ah. I forget that KIP-724 not only deprecated, but proposed a removal in
> > 4.0. Great.
> >
> > +1 (binding) for KIP-896
> >
> > best,
> > Colin
> >
> > On Tue, Nov 21, 2023, at 12:36, Ismael Juma wrote:
> > > Hi Colin,
> > >
> > > That change was proposed and approved via KIP-724:
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-724%3A+Drop+support+for+message+formats+v0+and+v1
> > >
> > > Ismael
> > >
> > > On Tue, Nov 21, 2023, 12:21 PM Colin McCabe 
> wrote:
> > >
> > >> Hi Ismael,
> > >>
> > >> Can we state somewhere that the message.format.version configuration
> > will
> > >> be gone in 4.0? We only will support one message format version (for
> > now,
> > >> at least). If we do want more versions later, I don't think we'll want
> > to
> > >> configure them via a static config.
> > >>
> > >> best,
> > >> Colin
> > >>
> > >>
> > >> On Tue, Nov 21, 2023, at 12:06, Ismael Juma wrote:
> > >> > Hi all,
> > >> >
> > >> > I would like to start a vote on KIP-896. Please take a look and let
> us
> > >> know
> > >> > what you think.
> > >> >
> > >> > Even though most of the changes in this KIP will be done for Apache
> > Kafka
> > >> > 4.0, I would like to introduce a new metric and new request log
> > attribute
> > >> > in Apache 3.7 to help users identify usage of deprecated protocol
> api
> > >> > versions.
> > >> >
> > >> > Link:
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-896%3A+Remove+old+client+protocol+API+versions+in+Kafka+4.0
> > >> >
> > >> > Thanks,
> > >> > Ismael
> > >>
> >
>


Re: [DISCUSS] Should we continue to merge without a green build? No!

2023-11-21 Thread Matthias J. Sax

Thanks Sophie. Overall I agree with you.

I think 50% is too high as a general rule, and I believe something like 
30% might be more appropriate (going lower, given the infrastructure at 
hand might become too aggressive)?


The difficult part about a policy like this is, that we don't really 
have statistics about it, so it will be up to the reviewers to keep 
monitoring and raising an "alert" (on the dev mailing list?) if we see 
too many failing builds.


How to achieve this? Personally, I think it would be best to agree on a 
"bug bash sprint". Not sure if folks are willing to sign up for this? 
The main question might be "when"? We recently discuss test stability in 
our team, and don't see capacity now, but want to invest more in Q1. 
(Btw: this issue goes beyond the build but also affect system tests.)



How long do we have to take action in the above cases?


You propose one day, but that's tricky? In the end, if a test is flaky, 
we won't know right away? Would we?



How do we track failures?


I also did file tickets and added comments to test in the past when they 
did re-fail. While labor intensive, it seems it worked best in the past. 
Justine mentioned Gradle Enterprise (I did not look into it yet, but 
maybe it can help to reduce manual labor)?



How often does a test need to fail to be considered flaky enough to take action?


Guess it's a judgement call, and I don't have a strong opinion. But I 
agree, that we might want to write down a rule. But again, a rule only 
make sense if we have data. If reviewers don't pay attention and don't 
comment on tickets to we can count how often a test fails, any number we 
put down won't be helpful.



In the end, to me it boils down to the willingness of all of us to 
tackle it, and to _make time_ to address flaky tests. On our side (ie, 
KS teams at Confluent), we want to put more time aside for this in 
quarterly planning, but in the end, without reliable data it's hard to 
know which tests to spent time on for the biggest bang for the bug. 
Thus, to me the second corner stone is to put the manual labor into 
tracking the frequency of flaky tests, what is a group effort.



-Matthias


On 11/21/23 1:33 PM, Sophie Blee-Goldman wrote:

For some concrete data, here are the stats for the latest build on two
community PRs I am currently reviewing:

https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-14648/16/tests
- 18 unrelated test failures
- 13 unique tests
- only 1 out of 4 JDK builds were green with all tests passing

https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-14735/4/tests
- 44(!) unrelated test failures
- not even going to count up the unique tests because there were too many
- 0 out of 4 JDK builds were green
- this particular build seemed to have a number of infra/timeout issues, so
it may have exacerbated the flakiness beyond the "usual", although as
others have noted, unstable infra is not uncommon

My point is, we clearly have a long way to go if we want to start enforcing
this policy and have any hope of merging any PRs and not driving away our
community of contributors.

This isn't meant to discourage anyone, actually the opposite: if we want to
start gating PRs on passing builds, we need to start tackling flaky tests
now!

On Tue, Nov 21, 2023 at 1:19 PM Sophie Blee-Goldman 
wrote:


In the interest of moving things forward, here is what we would need (in
my opinion) to start enforcing this:

1. Get overall build failure percentage under a certain threshold
   1. What is an acceptable number here?
   2. How do we achieve this: wait until all of them are fixed,
   disable everything that's flaky right away, etc
2. Come up with concrete policy rules so there's no confusion. I think
we need to agree on answers for these questions at least:
1. What happens if a new PR introduces a new test that is revealed to
   be flaky?
   2. What happens if a new PR makes an old test become flaky?
   3. How long do we have to take action in the above cases?
   4. How do we track failures?
   5. How often does a test need to fail to be considered flaky enough
   to take action?

Here's my take on these questions, but would love to hear from others:

1.1) Imo the failure rate has to be under 50% at the very highest. At 50
we still have half the builds failing, but 75% would pass after just one
retry, and 87.5% after two retries. Is that acceptable? Maybe we should aim
to kick things off with a higher success rate to give ourselves some wiggle
room over time, and have 50% be the absolute maximum failure rate -- if it
ever gets beyond that we trigger an emergency response (whether that be
blocking all feature work until the test failures are addressed, ending
this policy, etc)
1.2) I think we'd probably all agree that there's no way we'll be able to
triage all of the currently flaky tests in a reasonable time frame, but I'm
also wary of 

Re: [DISCUSS] Should we continue to merge without a green build? No!

2023-11-21 Thread Ismael Juma
Hi,

We have a dashboard already:

[image: image.png]

https://ge.apache.org/scans/tests?search.names=Git%20branch&search.relativeStartTime=P28D&search.rootProjectNames=kafka&search.timeZoneId=America%2FLos_Angeles&search.values=trunk&tests.sortField=FLAKY

On Tue, Nov 14, 2023 at 10:41 PM Николай Ижиков  wrote:

> Hello guys.
>
> I want to tell you about one more approach to deal with flaky tests.
> We adopt this approach in Apache Ignite community, so may be it can be
> helpful for Kafka, also.
>
> TL;DR: Apache Ignite community have a tool that provide a statistic of
> tests and can tell if PR introduces new failures.
>
> Apache Ignite has a many tests.
> Latest «Run All» contains around 75k.
> Most of test has integration style therefore count of flacky are
> significant.
>
> We build a tool - Team City Bot [1]
> That provides a combined statistic of flaky tests [2]
>
> This tool can compare results of Run All for PR and master.
> If all OK one can comment jira ticket with a visa from bot [3]
>
> Visa is a quality proof of PR for Ignite committers.
> And we can sort out most flaky tests and prioritize fixes with the bot
> statistic [2]
>
> TC bot integrated with the Team City only, for now.
> But, if Kafka community interested we can try to integrate it with Jenkins.
>
> [1] https://github.com/apache/ignite-teamcity-bot
> [2] https://tcbot2.sbt-ignite-dev.ru/current.html?branch=master&count=10
> [3]
> https://issues.apache.org/jira/browse/IGNITE-19950?focusedCommentId=17767394&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17767394
>
>
>
> > 15 нояб. 2023 г., в 09:18, Ismael Juma  написал(а):
> >
> > To use the pain analogy, people seem to have really good painkillers and
> > hence they somehow don't feel the pain already. ;)
> >
> > The reality is that important and high quality tests will get fixed. Poor
> > quality tests (low signal to noise ratio) might not get fixed and that's
> ok.
> >
> > I'm not opposed to marking the tests as release blockers as a starting
> > point, but I'm saying it's fine if people triage them and decide they are
> > not blockers. In fact, that has already happened in the past.
> >
> > Ismael
> >
> > On Tue, Nov 14, 2023 at 10:02 PM Matthias J. Sax 
> wrote:
> >
> >> I agree on the test gap argument. However, my worry is, if we don't
> >> "force the pain", it won't get fixed at all. -- I also know, that we try
> >> to find an working approach for many years...
> >>
> >> My take is that if we disable a test and file a non-blocking Jira, it's
> >> basically the same as just deleting the test all together and never talk
> >> about it again. -- I believe, this is not want we aim for, but we aim
> >> for good test coverage and a way to get these test fixed?
> >>
> >> Thus IMHO we need some forcing function (either keep the tests and feel
> >> the pain on every PR), or disable the test and file a blocker JIRA so
> >> the pain surfaces on a release forcing us to do something about it.
> >>
> >> If there is no forcing function, it basically means we are willing to
> >> accept test gaps forever.
> >>
> >>
> >> -Matthias
> >>
> >> On 11/14/23 9:09 PM, Ismael Juma wrote:
> >>> Matthias,
> >>>
> >>> Flaky tests are worse than useless. I know engineers find it hard to
> >>> disable them because of the supposed test gap (I find it hard too), but
> >> the
> >>> truth is that the test gap is already there! No-one blocks merging PRs
> on
> >>> flaky tests, but they do get used to ignoring build failures.
> >>>
> >>> The current approach has been attempted for nearly a decade and it has
> >>> never worked. I think we should try something different.
> >>>
> >>> When it comes to marking flaky tests as release blockers, I don't think
> >>> this should be done as a general rule. We should instead assess on a
> case
> >>> by case basis, same way we do it for bugs.
> >>>
> >>> Ismael
> >>>
> >>> On Tue, Nov 14, 2023 at 5:02 PM Matthias J. Sax 
> >> wrote:
> >>>
>  Thanks for starting this discussion David! I totally agree to "no"!
> 
>  I think there is no excuse whatsoever for merging PRs with compilation
>  errors (except an honest mistake for conflicting PRs that got merged
>  interleaved). -- Every committer must(!) check the Jenkins status
> before
>  merging to avoid such an issue.
> 
>  Similar for actual permanently broken tests. If there is no green
> build,
>  and the same test failed across multiple Jenkins runs, a committer
>  should detect this and cannot merge a PR.
> 
>  Given the current state of the CI pipeline, it seems possible to get
>  green runs, and thus I support the policy (that we actually always
> had)
>  to only merge if there is at least one green build. If committers got
>  sloppy about this, we need to call it out and put a hold on this
> >> practice.
> 
>  (The only exception from the above policy would be a very unstable
>  status for which getting a green build

Re: [DISCUSS] Road to Kafka 4.0

2023-11-21 Thread Luke Chen
Hi Colin and Jose,

I revisited the discussion of KIP-833 here
, and you
can see I'm the first one to reply to the discussion thread to express my
excitement at that time. Till now, I personally still think having KRaft in
Kafka is a good direction we have to move forward. But to move to this
destination, we need to make our users comfortable with this decision. The
worst scenario is, we said 4.0 is ready, and ZK is removed. Then, some
users move to 4.0 and say, wait a minute, why does it not support xxx
feature? And then start to search for other alternatives to replace Apache
Kafka. We all don't want to see this, right? So, that's why some community
users start to express their concern to move to 4.0 too quickly, including
me.


Quoting Colin:
> While dynamic quorum reconfiguration is a nice feature, it doesn't block
anything: not migration, not deployment.

Clearly Confluent team might deploy ZooKeeper in a particular way and
didn’t depend on its ability to support reconfiguration. So KRaft is ready
from your point of view. But users of Apache Kafka might have come to
depend on some ZooKeeper functionality, such as the ability to reconfigure
ZooKeeper quorums, that is not available in KRaft, yet. I don’t think the
Apache Kafka documentation has ever said “do not depend on this ability of
Apache Kafka or Zookeeper”, so it doesn’t seem unreasonable for users to
have deployed ZooKeeper in this way. In KIP-833
,
we said: “Modifying certain dynamic configurations on the standalone KRaft
controller” was an important missing feature. Unfortunately it wasn’t as
explicit as it could have been. While no one expects KRaft to support all
the features of ZooKeeper, it looks to me that users might depend on this
particular feature and it’s only recently that it’s become apparent that
you don’t consider it a blocker.

Quoting José:
> If we do a 3.8 release before 4.0 and we implement KIP-853 in 3.8, the
user will be able to migrate to a KRaft cluster that supports dynamically
changing the set of voters and has better support for disk failures.

Yes, KIP-853 and disk failure support are both very important missing
features. For the disk failure support, I don't think this is a
"good-to-have-feature", it should be a "must-have" IMO. We can't announce
the 4.0 release without a good solution for disk failure in KRaft.

It’s also worth thinking about how Apache Kafka users who depend on JBOD
might look at the risks of not having a 3.8 release. JBOD support on KRaft
is planned to be added in 3.7, and is still in progress so far. So it’s
hard to say it’s a blocker or not. But in practice, even if the feature is
made into 3.7 in time, a lot of new code for this feature is unlikely to be
entirely bug free. We need to maintain the confidence of those users, and
forcing them to migrate through 3.7 where this new code is hardly
battle-tested doesn’t appear to do that.

Our goal for 4.0 should be that all the “main” features in KRaft are in
production ready state. To reach the goal, I think having one more release
makes sense. We can have different opinions about what the “main features”
in KRaft are, but we should all agree, JBOD is one of them.

Alternatively, like Josep proposed, we can choose to have 4.0 + 3.7.x or
3.8 releases in parallel to maintain these 2 releases for a defined period.
But I think this is not a small effort to do that, especially as in v4.0,
much of ZK code will be removed, thus the diff between codebases will be
large. In other words the additional costs of the backporting required with
this alternative are likely to be higher than doing a 3.8 in my opinion.

Quoting José again:
> What are the disadvantages of adding the 3.8 release before 4.0? This
would push the 4.0 release by 3-4 months. From what we can tell, it would
also delay when KIP-896 can be implemented and extend how long the
community needs to maintain the code used by ZK mode. Is there anything
else?

If we agree with previous points, I think the disadvantages will just
disappear. The 3-4 months delay, the maintenance effort, KIP-896, and maybe
you can also raise scala 2.12 and java 8 removal, which are not that
critical compared with what I mentioned earlier that the worst case might
be that the users lose their confidence to Apache Kafka.


Quoting Colin:
> I would not want to delay that because we want an additional feature. And
we will always want additional features. So I am concerned we will end up
in an infinite loop of people asking for "just one more feature" before
they migrate.

I totally agree with you. We can keep delaying the 4.0 release forever. I'd
also like to draw a line to it. So, in my opinion, the 3.8 release is the
line. No 3.9, 3.10 releases after that. If this is the decision, will your
concern about this infinit

Re: [DISCUSS] Road to Kafka 4.0

2023-11-21 Thread Ismael Juma
Hi Luke,

I think we're conflating different things here. There are 3 separate points
in your email, but only 1 of them requires 3.8:

1. JBOD may have some bugs in 3.7.0. Whatever bugs exist can be fixed in
3.7.x. We have already said that we will backport critical fixes to 3.7.x
for some time.
2. Quorum reconfiguration is important to include in 4.0, the release where
ZK won't be supported. This doesn't need a 3.8 release either.
3. Quorum reconfiguration is necessary for migration use cases and hence
needs to be in a 3.x release. This one would require a 3.8 release if true.
But we should have a debate on whether it is indeed true. It's not clear to
me yet.

Ismael

On Tue, Nov 21, 2023 at 7:30 PM Luke Chen  wrote:

> Hi Colin and Jose,
>
> I revisited the discussion of KIP-833 here
> , and
> you
> can see I'm the first one to reply to the discussion thread to express my
> excitement at that time. Till now, I personally still think having KRaft in
> Kafka is a good direction we have to move forward. But to move to this
> destination, we need to make our users comfortable with this decision. The
> worst scenario is, we said 4.0 is ready, and ZK is removed. Then, some
> users move to 4.0 and say, wait a minute, why does it not support xxx
> feature? And then start to search for other alternatives to replace Apache
> Kafka. We all don't want to see this, right? So, that's why some community
> users start to express their concern to move to 4.0 too quickly, including
> me.
>
>
> Quoting Colin:
> > While dynamic quorum reconfiguration is a nice feature, it doesn't block
> anything: not migration, not deployment.
>
> Clearly Confluent team might deploy ZooKeeper in a particular way and
> didn’t depend on its ability to support reconfiguration. So KRaft is ready
> from your point of view. But users of Apache Kafka might have come to
> depend on some ZooKeeper functionality, such as the ability to reconfigure
> ZooKeeper quorums, that is not available in KRaft, yet. I don’t think the
> Apache Kafka documentation has ever said “do not depend on this ability of
> Apache Kafka or Zookeeper”, so it doesn’t seem unreasonable for users to
> have deployed ZooKeeper in this way. In KIP-833
> <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-833%3A+Mark+KRaft+as+Production+Ready#KIP833:MarkKRaftasProductionReady-MissingFeatures
> >,
> we said: “Modifying certain dynamic configurations on the standalone KRaft
> controller” was an important missing feature. Unfortunately it wasn’t as
> explicit as it could have been. While no one expects KRaft to support all
> the features of ZooKeeper, it looks to me that users might depend on this
> particular feature and it’s only recently that it’s become apparent that
> you don’t consider it a blocker.
>
> Quoting José:
> > If we do a 3.8 release before 4.0 and we implement KIP-853 in 3.8, the
> user will be able to migrate to a KRaft cluster that supports dynamically
> changing the set of voters and has better support for disk failures.
>
> Yes, KIP-853 and disk failure support are both very important missing
> features. For the disk failure support, I don't think this is a
> "good-to-have-feature", it should be a "must-have" IMO. We can't announce
> the 4.0 release without a good solution for disk failure in KRaft.
>
> It’s also worth thinking about how Apache Kafka users who depend on JBOD
> might look at the risks of not having a 3.8 release. JBOD support on KRaft
> is planned to be added in 3.7, and is still in progress so far. So it’s
> hard to say it’s a blocker or not. But in practice, even if the feature is
> made into 3.7 in time, a lot of new code for this feature is unlikely to be
> entirely bug free. We need to maintain the confidence of those users, and
> forcing them to migrate through 3.7 where this new code is hardly
> battle-tested doesn’t appear to do that.
>
> Our goal for 4.0 should be that all the “main” features in KRaft are in
> production ready state. To reach the goal, I think having one more release
> makes sense. We can have different opinions about what the “main features”
> in KRaft are, but we should all agree, JBOD is one of them.
>
> Alternatively, like Josep proposed, we can choose to have 4.0 + 3.7.x or
> 3.8 releases in parallel to maintain these 2 releases for a defined period.
> But I think this is not a small effort to do that, especially as in v4.0,
> much of ZK code will be removed, thus the diff between codebases will be
> large. In other words the additional costs of the backporting required with
> this alternative are likely to be higher than doing a 3.8 in my opinion.
>
> Quoting José again:
> > What are the disadvantages of adding the 3.8 release before 4.0? This
> would push the 4.0 release by 3-4 months. From what we can tell, it would
> also delay when KIP-896 can be implemented and extend how long the
> community needs to maintain the code used by ZK mode. Is th

Jenkins build is still unstable: Kafka » Kafka Branch Builder » trunk #2407

2023-11-21 Thread Apache Jenkins Server
See 




Re: How Kafka handle partition leader change?

2023-11-21 Thread De Gao
Looks like the core of the problem should still be the juggling game of 
consistency, availability and partition tolerance.  If we want the cluster 
still work when brokers have inconsistent information due to network partition, 
we have to choose between consistency and availability.
My proposal is not about fix the message loss. Will share when ready.
Thanks Andrew.

From: Andrew Grant 
Sent: 21 November 2023 12:35
To: dev@kafka.apache.org 
Subject: Re: How Kafka handle partition leader change?

Hey De Gao,

Message loss or duplication can actually happen even without a leadership 
change for a partition. For example if there are network issues and the 
producer never gets the ack from the server, it’ll retry and cause duplicates. 
Message loss can usually occur when you use acks=1 config - mostly you’d lose 
after a leadership change but in theory if the leader was restarted, the page 
cache was lost and it stayed leader again we could lose the message if it 
wasn’t replicated soon enough.

You might be right it’s more likely to occur during leadership change though - 
not 100% sure myself on that.

Point being, the idempotent producer really is the way to write once and only 
once as far as I’m aware.

If you have any suggestions for improvements I’m sure the community would love 
to hear them! It’s possible there are ways to make leadership changes more 
seamless and at least reduce the probability of duplicates or loss. Not sure 
myself. I’ve wondered before if the older leader could reroute messages for a 
small period of time until the client knew the new leader for example.

Andrew

Sent from my iPhone

> On Nov 21, 2023, at 1:42 AM, De Gao  wrote:
>
> I am asking this because I want to propose a change to Kafka. But looks like 
> in certain scenario it is very hard to not loss or duplication messages. 
> Wonder in what scenario we can accept that and where to draw the line?
>
> 
> From: De Gao 
> Sent: 21 November 2023 6:25
> To: dev@kafka.apache.org 
> Subject: Re: How Kafka handle partition leader change?
>
> Thanks Andrew.  Sounds like the leadership change from Kafka side is a 'best 
> effort' to avoid message duplicate or loss. Can we say that message lost is 
> very likely during leadership change unless producer uses idempotency? Is 
> this a generic situation that no intent to provide data integration guarantee 
> upon metadata change?
> 
> From: Andrew Grant 
> Sent: 20 November 2023 12:26
> To: dev@kafka.apache.org 
> Subject: Re: How Kafka handle partition leader change?
>
> Hey De Gao,
>
> The controller is the one that always elects a new leader. When that happens 
> that metadata is changed on the controller and once committed it’s broadcast 
> to all brokers in the cluster. In KRaft this would be via a PartitonChange 
> record that each broker will fetch from the controller. In ZK it’d be via an 
> RPC from the controller to the broker.
>
> In either case each broker might get the notification at a different time. No 
> ordering guarantee among the brokers. But eventually they’ll all know the new 
> leader which means eventually the Produce will fail with NotLeader and the 
> client will refresh its metadata and find out the new one.
>
> In between all that leadership movement, there are various ways messages can 
> get duplicated or lost. However if you use the idempotent producer I believe 
> you actually won’t see dupes or missing messages so if that’s an important 
> requirement you could look into that. The producer is designed to retry in 
> general and when you use the idempotent producer some extra metadata is sent 
> around to dedupe any messages server-side that were sent multiple times by 
> the client.
>
> If you’re interested in learning more Kafka internals I highly recommend this 
> blog series 
> https://www.google.com/url?q=https://www.confluent.io/blog/apache-kafka-architecture-and-internals-by-jun-rao/&source=gmail-imap&ust=170115373700&usg=AOvVaw1Bnr9YgbxvIt1NJmgdFzn5
>
> Hope that helped a bit.
>
> Andy
>
> Sent from my iPhone
>
>> On Nov 20, 2023, at 2:07 AM, De Gao  wrote:
>>
>> Hi all I have a interesting question here.
>>
>> Let's say we have 2 broker B1 B2, controller C and producer P1, P2...Pn. 
>> Currently B1 holds the partition leader and Px is constantly producing 
>> messages to B1. We want to move the partition leadership to B2. How does the 
>> leadership change synced between B1, B2, C, and Px that it is guaranteed 
>> that all the parties acknowledged the leadership change in the right order? 
>> Was there a break of produce flow in between? Any chance of  message lost?
>>
>> Thanks
>>
>> De Gao