Jenkins build is still unstable: Kafka » Kafka Branch Builder » trunk #3323

2024-09-20 Thread Apache Jenkins Server
See 




[jira] [Created] (KAFKA-17584) Changing dynamic configurations resets retention

2024-09-20 Thread Christo Lolov (Jira)
Christo Lolov created KAFKA-17584:
-

 Summary: Changing dynamic configurations resets retention
 Key: KAFKA-17584
 URL: https://issues.apache.org/jira/browse/KAFKA-17584
 Project: Kafka
  Issue Type: Bug
Reporter: Christo Lolov
Assignee: Christo Lolov


Updating certain dynamic configurations (for example `message.max.bytes`) 
causes retention based on time to reset to the value for log.retention.ms. This 
poses a durability issue if users have set their retention by using 
log.retention.hours or log.retention.minutes. In other words, if a user has set 
log.retention.hours=-1 (infinite retention) and they dynamically change 
`message.max.bytes` their retention will immediately change back to the default 
of 60480 ms (7 days) and data before this will be scheduled for deletion 
immediately.

Steps to reproduce:
 # Add log.retention.minutes=1,log.retention.check.interval.ms=1000 to 
server.properties
 # Start a single ZK or KRaft instance + a single Kafka instance
 # Create a topic using 
```
bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic A 
--replication-factor 1 --partitions 1 --config min.insync.replicas=1 --config 
segment.bytes=512
```
 # Create a few segments with the console producer
 # Observe that they are deleted after 1 minute
 # Use the following command
```
bin/kafka-configs.sh --bootstrap-server loclahost:9092 --entity-type brokers 
--entity-default --alter --add-config message.max.bytes=1048609
```
(the value of `message.max.bytes` is irrelevant)
 # Create a few more segments with the console producer
 # Observe that segments are no longer deleted after 1 minute



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: New release branch 3.9

2024-09-20 Thread Christo Lolov
Hello,

I have filed https://issues.apache.org/jira/browse/KAFKA-17584 as a
blocker. While it has not been introduced by a KIP, I think this has
durability implications and needs to be addressed immediately.

Let me know if you disagree!

Best,
Christo

On Thu, 19 Sept 2024 at 16:01, José Armando García Sancio
 wrote:

> Hi Colin,
>
> I have a minor improvement to the rendering of our documentation that
> we should include in 3.9: https://github.com/apache/kafka/pull/17235
>
> Thanks
> -Jose
>


[jira] [Resolved] (KAFKA-17579) Dynamic LogCleaner configurations are not picked up upon restart

2024-09-20 Thread Christo Lolov (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-17579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christo Lolov resolved KAFKA-17579.
---
Resolution: Fixed

> Dynamic LogCleaner configurations are not picked up upon restart
> 
>
> Key: KAFKA-17579
> URL: https://issues.apache.org/jira/browse/KAFKA-17579
> Project: Kafka
>  Issue Type: Bug
>Reporter: Christo Lolov
>Priority: Major
>
> Dynamic configurations for the LogCleaner are not applied upon a restart.
>  
> Reproduction steps:
>  # Create a 1-broker cluster
>  # Change the number of log.cleaner.threads to 2
>  # Bounce the broker
>  # Observe the server.logs - you will notice that only 1 log cleaner thread 
> has been instantiated
>  
> Proposed solution:
>  # Change immutable variables `val` to methods `def`
>  # Add unit/integration tests which test reconfiguration of the LogCleaner



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] KIP-1082: Enable ID Generation for Clients over the ConsumerGroupHeartbeat RPC

2024-09-20 Thread Chia-Ping Tsai
> This part is not clear either. It basically says that if a member joins with 
> an existing member id but a different epoch, it will be fenced. Then it must 
> rejoin with the same member id and epoch zero. This is already the current 
> behavior and it does not help with detecting duplicates, right?

The duplicates issue is interesting. The member id is generated on server-side 
by UUID before, so it seems to me the fenced member epoch happens only if the 
client miss the response with bumped epoch, and this case is recoverable.

however, v1 brings a different fenced epoch scenario. Users, now, have 
responsibility to generate unique member id, hence clients may encounter 
infinite fenced error if there are >1 clients having same member id due to 
config error (or other engineer error)

Maybe we can highlight this scenario if we start to requires clients to 
generate unique member id.

Best,
Chia-Ping

On 2024/09/19 18:36:48 David Jacot wrote:
> Hi,
> 
> Thanks for the update. I have a few nits:
> 
> > If the member ID is null or empty, the server will reject the request
> with an InvalidRequestException.
> We should clarify that this should only apply to version >= 1.
> 
> > The consumer instance must generate a member ID, and this ID should
> remain consistent for the duration of the consumer's session. Here, a
> "session" is defined as the period from the consumer's first heartbeat
> until it leaves the group, either through a graceful shutdown, a heartbeat
> timeout, or the process stopping or dying. The consumer instance should
> reuse the same member ID for all heartbeats and rejoin attempts to maintain
> continuity within the group.
> 
> This part is not clear to me. When the member leaves the group, it should
> not reset the member id. I would rather say that the member must generate
> its member id when it starts and it must keep it until the process stops.
> It is basically an incarnation of the process.
> 
> > If a conflict arises where the member ID generated by the client is
> detected to be a duplicate within the same group (for example, the same
> member ID is associated with another active member in the group), the
> server will handle this by comparing the memberEpoch values of the
> conflicting members. The member with the lower memberEpoch is considered
> outdated and will be fenced off by the server. When this occurs, the server
> responds with a FENCED_MEMBER_EPOCH error to the client, signaling it to
> rejoin the group with the same member ID while resetting the memberEpoch to
> zero. This ensures that the client properly resynchronizes and maintains
> the continuity and consistency of the group membership.
> 
> This part is not clear either. It basically says that if a member joins
> with an existing member id but a different epoch, it will be fenced. Then
> it must rejoin with the same member id and epoch zero. This is already the
> current behavior and it does not help with detecting duplicates, right?
> Should we just remove the paragraph?
> 
> > A member ID mismatch occurs within a session: If the server detects a
> mismatch between the provided member ID and the expected member ID for an
> ongoing session, it should return a UNKNOWN_MEMBER_ID  error.
> 
> How could we detect a mismatch between the provided and the expected member
> id? My understanding is that we can only know whether the provided member
> id exists or not. This is already implemented.
> 
> Thanks,
> David
> 
> On Sat, Sep 14, 2024 at 9:31 AM TengYao Chi  wrote:
> 
> > Hello everyone,
> >
> > Since this KIP has been fully discussed, I will initiate a vote for it next
> > Monday.
> > Thank you and have a nice weekend.
> >
> > Best regards,
> > TengYao
> >
> > TengYao Chi  於 2024年9月5日 週四 下午2:19寫道:
> >
> > > Hello everyone,
> > >
> > > KT2: It looks like everyone who has expressed an opinion supports the
> > > second option: “Document a recommendation for clients to use UUIDs as
> > > member IDs, without strictly enforcing it.”
> > > I have updated the KIP accordingly.
> > > Please take a look, and let me know if you have any thoughts or feedback.
> > >
> > > Thank you!
> > >
> > > Best regards,
> > > TengYao
> > >
> > > Chia-Ping Tsai  於 2024年8月30日 週五 下午9:56寫道:
> > >
> > >> hi TengYao
> > >>
> > >> KT2: +1 to second approach
> > >>
> > >> Best,
> > >> Chia-Ping
> > >>
> > >>
> > >> David Jacot  於 2024年8月30日 週五 下午9:15寫道:
> > >>
> > >> > Hi TengYao,
> > >> >
> > >> > KT2: I don't think that we can realistically validate the uuid on the
> > >> > server. It is basically a string of chars. So I lean towards having a
> > >> good
> > >> > recommendation in the KIP and in the document of the field in the
> > RPC's
> > >> > definition.
> > >> >
> > >> > Best,
> > >> > David
> > >> >
> > >> > On Fri, Aug 30, 2024 at 3:02 PM TengYao Chi 
> > >> wrote:
> > >> >
> > >> > > Hello Kirk !
> > >> > >
> > >> > > Thank you for your comments !
> > >> > >
> > >> > > KT1: Yes, you are correct. The issue is not unique to the initial

[jira] [Created] (KAFKA-17585) `offsetResetStrategyTimestamp` should return `long` instead of `Long`

2024-09-20 Thread Chia-Ping Tsai (Jira)
Chia-Ping Tsai created KAFKA-17585:
--

 Summary: `offsetResetStrategyTimestamp` should return `long` 
instead of `Long`
 Key: KAFKA-17585
 URL: https://issues.apache.org/jira/browse/KAFKA-17585
 Project: Kafka
  Issue Type: Improvement
Reporter: Chia-Ping Tsai
Assignee: Chia-Ping Tsai


the null behavior was removed by 
[https://github.com/apache/kafka/commit/2b233bfa5f35bf237effe8c5202fd2b80c601943,]
 so we can return long and then remove the null check to simplify the code.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Jenkins build is still unstable: Kafka » Kafka Branch Builder » trunk #3321

2024-09-20 Thread Apache Jenkins Server
See 




Re: [VOTE] KIP-1074: Allow the replication of user internal topics

2024-09-20 Thread Patrik Marton
Hi All,

Thank you so much for the votes so far! We're just one vote away from
reaching our goal for the KIP. I’d really appreciate it if you could take a
moment to vote or share your thoughts!

Thanks,
Patrik

On Mon, Sep 9, 2024 at 4:20 PM Chris Egerton 
wrote:

> Hi Patrik,
>
> Sorry for the delay, I've been on PTO. LGTM! +1 (binding)
>
> Cheers,
>
> Chris
>
> On Tue, Sep 3, 2024 at 5:10 AM Viktor Somogyi-Vass
>  wrote:
>
> > Hi Patrik,
> >
> > Thanks for working on this. +1 from me (binding).
> >
> > Best,
> > Viktor
> >
> > On Wed, Aug 28, 2024 at 1:34 PM Patrik Marton
>  > >
> > wrote:
> >
> > > Hi all,
> > >
> > > As the proposal is finalized, and there was no new feedback in the past
> > > week, I would like to open voting for KIP-1074
> > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1074%3A+Allow+the+replication+of+user+internal+topics
> > > >
> > > .
> > >
> > > Please vote / let me know your thoughts.
> > >
> > > Thanks,
> > > Patrik
> > >
> >
>


Re: [DISCUSS] Apache Kafka 3.8.1 release

2024-09-20 Thread Chia-Ping Tsai
+1

Luke Chen  於 2024年9月20日 週五 下午4:22寫道:

> Thanks Josep!
> +1
>
> Luke
>
> On Fri, Sep 20, 2024 at 3:36 PM Josep Prat 
> wrote:
>
> > Hey folks,
> >
> > I'd like to volunteer to be the release manager for a bug fix release of
> > the 3.8 series. This will be the first bug fix release and will be
> version
> > 3.8.1.
> >
> > If no one has any objections, I will send out a release plan on
> 2024/09/23
> > that includes a list of all of the fixes we are targeting for 3.8.1 along
> > with a timeline (aiming probably for a release happening at the beginning
> > of October).
> >
> > Thanks!
> >
> > --
> > [image: Aiven] 
> >
> > *Josep Prat*
> > Open Source Engineering Director, *Aiven*
> > josep.p...@aiven.io   |   +491715557497
> > aiven.io    |   <
> https://www.facebook.com/aivencloud
> > >
> >      <
> > https://twitter.com/aiven_io>
> > *Aiven Deutschland GmbH*
> > Alexanderufer 3-7, 10117 Berlin
> > Geschäftsführer: Oskari Saarenmaa, Hannu Valtonen,
> > Anna Richardson, Kenneth Chen
> > Amtsgericht Charlottenburg, HRB 209739 B
> >
>


Re: [DISCUSS] KIP-1073 Return inactive observer nodes in DescribeQuorum response

2024-09-20 Thread Gantigmaa Selenge
Hi everyone,

Thanks for reviewing the KIP again! I have updated it to address the latest
comments from Colin and Jose.

Going back to Jose's earlier comment here:
> Did you consider making "includeFencedBroker" field value implicitly
based on the DescribeClusterRequest version and the EndpointType? For
example, if the request version is 2 and the endpoint type is broker,
the response will include all of the brokers and if they are fenced or
unfenced. I am suggesting this because a request of endpointType
controller and includeFencedBroker true doesn't make sense. In
general, users of the AdminClient cannot easily discover the API
versions supported by the destination node since the destination node
may not be known when the admin call is made.

I think admins would request fenced brokers only in certain scenarios, so
perhaps it makes sense to include them only when requested via the new
option. If the endpoint type is controller and the option to include fenced
brokers set to true, the Admin Client can throw an
"IllegalArgumentException" without sending the request to the controller.

If you are happy with the updates I made, I would like to start the vote on
this KIP. Thank you.

Regards,
Gantigmaa


On Sat, Sep 14, 2024 at 9:51 PM Colin McCabe  wrote:

> Hi Gantigmaa Selenge,
>
> I think we should have a boolean in the request like "includeFenced" which
> defaults to false. Then if we can't include the fenced brokers, we throw an
> UnsupportedVersionException and fall back to the old behavior.
>
> There are two main reasons for this:
> 1. if we don't do this, we're changing the behavior for old clients in an
> incompatible way. They used to get only unfenced brokers, now they get both
> and they have no way of telling which is which.
>
> 2. new clients still would like to know whether they're getting
> everything, or just unfenced brokers, for informational reasons.
>
> Probably the command-line tool could print out something like "fenced
> brokers will not be shown" in the case where it's talking to an old server.
> Or the tool could have a flag like --show-fenced which fails if it can't be
> done?
>
> best,
> Colin
>
>
> On Fri, Sep 13, 2024, at 08:28, José Armando García Sancio wrote:
> > Thanks Gantigmaa. See comments below. The KIP LGTM after this.
> >
> > On Thu, Sep 12, 2024 at 4:50 AM Gantigmaa Selenge 
> wrote:
> >> Will this make it more confusing for combined nodes? Users might expect
> to
> >> see both controller and broker under ROLES if it is a combined node.
> Since
> >> we would be able to output only one depending on the bootstrap option, I
> >> wonder if it's better to stay consistent with the information in the
> result
> >> returned by the API.
> >
> > How about ENDPOINT-TYPE for the column name? That matches the keyword
> > used in the request and better matches the semantic of the output.
> >
> > Also, I am okay if you want to leave it out. I thought the user may
> > find it useful but if you find it confusing we can leave it out.
> >
> > Thanks,
> > --
> > -José
>
>


[DISCUSS] Apache Kafka 3.8.1 release

2024-09-20 Thread Josep Prat
Hey folks,

I'd like to volunteer to be the release manager for a bug fix release of
the 3.8 series. This will be the first bug fix release and will be version
3.8.1.

If no one has any objections, I will send out a release plan on 2024/09/23
that includes a list of all of the fixes we are targeting for 3.8.1 along
with a timeline (aiming probably for a release happening at the beginning
of October).

Thanks!

-- 
[image: Aiven] 

*Josep Prat*
Open Source Engineering Director, *Aiven*
josep.p...@aiven.io   |   +491715557497
aiven.io    |   
     
*Aiven Deutschland GmbH*
Alexanderufer 3-7, 10117 Berlin
Geschäftsführer: Oskari Saarenmaa, Hannu Valtonen,
Anna Richardson, Kenneth Chen
Amtsgericht Charlottenburg, HRB 209739 B


Re: [DISCUSS] Apache Kafka 3.8.1 release

2024-09-20 Thread Luke Chen
Thanks Josep!
+1

Luke

On Fri, Sep 20, 2024 at 3:36 PM Josep Prat 
wrote:

> Hey folks,
>
> I'd like to volunteer to be the release manager for a bug fix release of
> the 3.8 series. This will be the first bug fix release and will be version
> 3.8.1.
>
> If no one has any objections, I will send out a release plan on 2024/09/23
> that includes a list of all of the fixes we are targeting for 3.8.1 along
> with a timeline (aiming probably for a release happening at the beginning
> of October).
>
> Thanks!
>
> --
> [image: Aiven] 
>
> *Josep Prat*
> Open Source Engineering Director, *Aiven*
> josep.p...@aiven.io   |   +491715557497
> aiven.io    |    >
>      <
> https://twitter.com/aiven_io>
> *Aiven Deutschland GmbH*
> Alexanderufer 3-7, 10117 Berlin
> Geschäftsführer: Oskari Saarenmaa, Hannu Valtonen,
> Anna Richardson, Kenneth Chen
> Amtsgericht Charlottenburg, HRB 209739 B
>


Re: [DISCUSS] KIP-1082: Enable ID Generation for Clients over the ConsumerGroupHeartbeat RPC

2024-09-20 Thread TengYao Chi
Hi Chia-Ping,

Thanks for pointing out this issue.

I’m thinking that maybe we might need to define a new Exception to handle
this scenario.
Once the client receives this exception, it should consider the exception
as a serious error and stop the process, and the user will need to assign a
new memberId.
What do you think?

Best Regards,
TengYao

Chia-Ping Tsai  於 2024年9月20日 週五 下午7:00寫道:

> > This part is not clear either. It basically says that if a member joins
> with an existing member id but a different epoch, it will be fenced. Then
> it must rejoin with the same member id and epoch zero. This is already the
> current behavior and it does not help with detecting duplicates, right?
>
> The duplicates issue is interesting. The member id is generated on
> server-side by UUID before, so it seems to me the fenced member epoch
> happens only if the client miss the response with bumped epoch, and this
> case is recoverable.
>
> however, v1 brings a different fenced epoch scenario. Users, now, have
> responsibility to generate unique member id, hence clients may encounter
> infinite fenced error if there are >1 clients having same member id due to
> config error (or other engineer error)
>
> Maybe we can highlight this scenario if we start to requires clients to
> generate unique member id.
>
> Best,
> Chia-Ping
>
> On 2024/09/19 18:36:48 David Jacot wrote:
> > Hi,
> >
> > Thanks for the update. I have a few nits:
> >
> > > If the member ID is null or empty, the server will reject the request
> > with an InvalidRequestException.
> > We should clarify that this should only apply to version >= 1.
> >
> > > The consumer instance must generate a member ID, and this ID should
> > remain consistent for the duration of the consumer's session. Here, a
> > "session" is defined as the period from the consumer's first heartbeat
> > until it leaves the group, either through a graceful shutdown, a
> heartbeat
> > timeout, or the process stopping or dying. The consumer instance should
> > reuse the same member ID for all heartbeats and rejoin attempts to
> maintain
> > continuity within the group.
> >
> > This part is not clear to me. When the member leaves the group, it should
> > not reset the member id. I would rather say that the member must generate
> > its member id when it starts and it must keep it until the process stops.
> > It is basically an incarnation of the process.
> >
> > > If a conflict arises where the member ID generated by the client is
> > detected to be a duplicate within the same group (for example, the same
> > member ID is associated with another active member in the group), the
> > server will handle this by comparing the memberEpoch values of the
> > conflicting members. The member with the lower memberEpoch is considered
> > outdated and will be fenced off by the server. When this occurs, the
> server
> > responds with a FENCED_MEMBER_EPOCH error to the client, signaling it to
> > rejoin the group with the same member ID while resetting the memberEpoch
> to
> > zero. This ensures that the client properly resynchronizes and maintains
> > the continuity and consistency of the group membership.
> >
> > This part is not clear either. It basically says that if a member joins
> > with an existing member id but a different epoch, it will be fenced. Then
> > it must rejoin with the same member id and epoch zero. This is already
> the
> > current behavior and it does not help with detecting duplicates, right?
> > Should we just remove the paragraph?
> >
> > > A member ID mismatch occurs within a session: If the server detects a
> > mismatch between the provided member ID and the expected member ID for an
> > ongoing session, it should return a UNKNOWN_MEMBER_ID  error.
> >
> > How could we detect a mismatch between the provided and the expected
> member
> > id? My understanding is that we can only know whether the provided member
> > id exists or not. This is already implemented.
> >
> > Thanks,
> > David
> >
> > On Sat, Sep 14, 2024 at 9:31 AM TengYao Chi  wrote:
> >
> > > Hello everyone,
> > >
> > > Since this KIP has been fully discussed, I will initiate a vote for it
> next
> > > Monday.
> > > Thank you and have a nice weekend.
> > >
> > > Best regards,
> > > TengYao
> > >
> > > TengYao Chi  於 2024年9月5日 週四 下午2:19寫道:
> > >
> > > > Hello everyone,
> > > >
> > > > KT2: It looks like everyone who has expressed an opinion supports the
> > > > second option: “Document a recommendation for clients to use UUIDs as
> > > > member IDs, without strictly enforcing it.”
> > > > I have updated the KIP accordingly.
> > > > Please take a look, and let me know if you have any thoughts or
> feedback.
> > > >
> > > > Thank you!
> > > >
> > > > Best regards,
> > > > TengYao
> > > >
> > > > Chia-Ping Tsai  於 2024年8月30日 週五 下午9:56寫道:
> > > >
> > > >> hi TengYao
> > > >>
> > > >> KT2: +1 to second approach
> > > >>
> > > >> Best,
> > > >> Chia-Ping
> > > >>
> > > >>
> > > >> David Jacot  於 2024年8月30日 週五 下午9:15寫道:
> > > >>
>

Re: [DISCUSS] KIP-1090 Flaky Tests 👻

2024-09-20 Thread David Arthur
Chia-Ping, I think something like that can work. I was also thinking about
extracting the test names during trunk builds using Gradle and storing that
somewhere. I think it's fair to say we can derive this data from Git and
Develocity. We can probably figure out the implementation details later on
:)

I've updated the KIP to describe the two-tiered approach as well as
including a diagram.

-David


On Thu, Sep 19, 2024 at 12:40 PM Chia-Ping Tsai  wrote:

> > However, this doesn't help with the newly added tests that introduce
> flakiness. For this, we need some way
> to detect when a new test has been added. Still thinking through this...
>
> Maybe we can use git + gradle develocity to address it.
>
> 1) list the files changed recently (for example: git diff --name-only "@{7
> days ago}")
> 2) use the test-related filename to query Gradle Develocity with "7 days
> ago"
> 3) the test cases having fewer builds are viewed as "new tests"
>
> WDYT?
>
> David Arthur  於 2024年9月19日 週四 下午11:56寫道:
>
> > TengYao,
> >
> > > These two mechanisms are independent. We could manually remove a tag
> from
> > a
> > test, but at the same time, it might still be quarantined.
> > I know the above situation might sound weird, but I just want to
> understand
> > how it would work.
> >
> > If we remove a tag from a test, we are signaling that we *think* it is
> now
> > stable. However, until we have data
> > to prove that, we can run the test in the quarantine. This helps with the
> > situation where there are multiple
> > sources of flakiness and we are making iterative progress on improving
> it.
> >
> > The alternative is to keep the fix in a PR until the developer gets
> enough
> > signal from the PR builds to be
> > convinced that the flakiness is gone. I mention in the KIP that this
> > approach has issues.
> >
> >
> > Chia-Ping,
> >
> > > If so, could we query the Gradle develocity to get flaky and then make
> > only those flaky
> > retryable in CI?
> >
> > This can work for existing flaky tests. We can query Develocity to find
> > flaky tests in the last N days
> > and run only those tests with retry. This is what I meant by "data
> driven"
> > for the quarantine. So, no need to track
> > "quarantined.txt" file separately -- we can derive this data at build
> time
> > from Develocity.
> >
> > However, this doesn't help with the newly added tests that introduce
> > flakiness. For this, we need some way
> > to detect when a new test has been added. Still thinking through this...
> >
> > -David
> >
> > On Thu, Sep 19, 2024 at 11:33 AM Chia-Ping Tsai 
> > wrote:
> >
> > > hi David
> > >
> > > The two-tiered approach is interesting and I have questions similar to
> > > TengYao.
> > >
> > > BUT, go back to the usage of quarantine and isolation. It seems to me
> > they
> > > are used to make our CI not be noised by the flaky, right? If so, could
> > we
> > > query the Gradle develocity to get flaky and then make only those flaky
> > > retryable in CI? That means a non-flaky (stable) test must pass without
> > > retry.
> > >
> > > This approach is more simple (I guess), and it can isolate the flaky
> (by
> > > retry) from CI. If a flaky can't pass after retry, I feel it is a
> > critical
> > > failed test rather than just a flaky. On top of that, this approach
> does
> > > not need an extra file ("quarantined.txt") in git control and it does
> not
> > > need to change any code of test classes.
> > >
> > > Best,
> > > Chia-Ping
> > >
> > > TengYao Chi  於 2024年9月19日 週四 下午11:19寫道:
> > >
> > > > Hi David,
> > > >
> > > > Thanks for the explanation.
> > > >
> > > > I like this two-tiered approach, which gives us more flexibility to
> > > handle
> > > > flaky tests.
> > > >
> > > > The following is my understanding of how it works; please correct me
> if
> > > I'm
> > > > wrong:
> > > > If we adopt the two-tiered approach, the test might have two
> > > > states.(Isolated by developer, Quarantined automatically).
> > > > These two mechanisms are independent. We could manually remove a tag
> > > from a
> > > > test, but at the same time, it might still be quarantined.
> > > > I know the above situation might sound weird, but I just want to
> > > understand
> > > > how it would work.
> > > >
> > > > Best Regards,
> > > > TengYao
> > > >
> > > > David Arthur  於 2024年9月19日 週四 下午10:07寫道:
> > > >
> > > > > Chia/TengYao/TaiJuWu, I agree that tags are a straightforward
> > approach.
> > > > In
> > > > > fact, my initial idea was to use tags as the isolation mechanism.
> > > > >
> > > > > Let me try to motivate the use of a text file a bit more.
> > > > >
> > > > > Consider the "new tests" scenario where a developer has added a new
> > > > > integration test. If we use annotations, this means someone (the
> > > original
> > > > > developer, or another committer) will need to raise a PR after a
> few
> > > days
> > > > > to remove the annotation (assuming the test was stable).
> Eventually,
> > I
> > > > was
> > > > > hoping to automate or partial

Re: [DISCUSS] KIP-1082: Enable ID Generation for Clients over the ConsumerGroupHeartbeat RPC

2024-09-20 Thread Chia-Ping Tsai
wait a minute. There is a description about fenced error in the KIP-848

> Checks whether the member epoch matches the member epoch in its current
assignment. FENCED_MEMBER_EPOCH is returned otherwise. The member is also
removed from the group.

I failed to find the code of implementing "The member is also removed from
the group.". I will appreciate it if someone can share the code link to me.

Let's assume it will be implemented eventually, and I feel that will cause
a serious issue as the consumers having the same member id will remove the
existing  member and then create itself member?

> Once the client receives this exception, it should consider the exception
as a serious error and stop the process, and the user will need to assign a
new memberId.

Given that member id is the incarnation of consumer, it is hard to detect
the scenario of "two consumers have the same member id" (it may be caused
by the lost response). I guess the best approach is the later -created
consumer should be idle with fenced error, and we should add this to the
KIP (maybe we should revise the error message carried by
`FencedMemberEpochException`). BUT, we need to rethink the approach if
above description "The member is also removed from the group" is true

Best,
Chia-Ping



TengYao Chi  於 2024年9月20日 週五 下午8:17寫道:

> Hi Chia-Ping,
>
> Thanks for pointing out this issue.
>
> I’m thinking that maybe we might need to define a new Exception to handle
> this scenario.
> Once the client receives this exception, it should consider the exception
> as a serious error and stop the process, and the user will need to assign a
> new memberId.
> What do you think?
>
> Best Regards,
> TengYao
>
> Chia-Ping Tsai  於 2024年9月20日 週五 下午7:00寫道:
>
> > > This part is not clear either. It basically says that if a member joins
> > with an existing member id but a different epoch, it will be fenced. Then
> > it must rejoin with the same member id and epoch zero. This is already
> the
> > current behavior and it does not help with detecting duplicates, right?
> >
> > The duplicates issue is interesting. The member id is generated on
> > server-side by UUID before, so it seems to me the fenced member epoch
> > happens only if the client miss the response with bumped epoch, and this
> > case is recoverable.
> >
> > however, v1 brings a different fenced epoch scenario. Users, now, have
> > responsibility to generate unique member id, hence clients may encounter
> > infinite fenced error if there are >1 clients having same member id due
> to
> > config error (or other engineer error)
> >
> > Maybe we can highlight this scenario if we start to requires clients to
> > generate unique member id.
> >
> > Best,
> > Chia-Ping
> >
> > On 2024/09/19 18:36:48 David Jacot wrote:
> > > Hi,
> > >
> > > Thanks for the update. I have a few nits:
> > >
> > > > If the member ID is null or empty, the server will reject the request
> > > with an InvalidRequestException.
> > > We should clarify that this should only apply to version >= 1.
> > >
> > > > The consumer instance must generate a member ID, and this ID should
> > > remain consistent for the duration of the consumer's session. Here, a
> > > "session" is defined as the period from the consumer's first heartbeat
> > > until it leaves the group, either through a graceful shutdown, a
> > heartbeat
> > > timeout, or the process stopping or dying. The consumer instance should
> > > reuse the same member ID for all heartbeats and rejoin attempts to
> > maintain
> > > continuity within the group.
> > >
> > > This part is not clear to me. When the member leaves the group, it
> should
> > > not reset the member id. I would rather say that the member must
> generate
> > > its member id when it starts and it must keep it until the process
> stops.
> > > It is basically an incarnation of the process.
> > >
> > > > If a conflict arises where the member ID generated by the client is
> > > detected to be a duplicate within the same group (for example, the same
> > > member ID is associated with another active member in the group), the
> > > server will handle this by comparing the memberEpoch values of the
> > > conflicting members. The member with the lower memberEpoch is
> considered
> > > outdated and will be fenced off by the server. When this occurs, the
> > server
> > > responds with a FENCED_MEMBER_EPOCH error to the client, signaling it
> to
> > > rejoin the group with the same member ID while resetting the
> memberEpoch
> > to
> > > zero. This ensures that the client properly resynchronizes and
> maintains
> > > the continuity and consistency of the group membership.
> > >
> > > This part is not clear either. It basically says that if a member joins
> > > with an existing member id but a different epoch, it will be fenced.
> Then
> > > it must rejoin with the same member id and epoch zero. This is already
> > the
> > > current behavior and it does not help with detecting duplicates, right?
> > > Should we just remove the parag

[jira] [Resolved] (KAFKA-16813) Add global timeout for "@Test" and "@TestTemplate"

2024-09-20 Thread Chia-Ping Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chia-Ping Tsai resolved KAFKA-16813.

Fix Version/s: 4.0.0
   Resolution: Fixed

> Add global timeout for "@Test" and "@TestTemplate"
> --
>
> Key: KAFKA-16813
> URL: https://issues.apache.org/jira/browse/KAFKA-16813
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Chia-Ping Tsai
>Assignee: 黃竣陽
>Priority: Minor
> Fix For: 4.0.0
>
>
> in code base `@Test` is used by unit test and `@TestTemplate` is used by 
> integration test. The later includes `ParameterizedTest`, `ClusterTest`, 
> `ClusterTests`, and `ClusterTemplate`. Hence, we can add two different 
> timeout for `@Test` and `@TestTemplate`. For example:
> junit.jupiter.execution.timeout.default = 30s
> junit.jupiter.execution.timeout.testtemplate.method.default = 120s
> The accurate timeout value may need more discussion, but we can try it in 
> small junit5 module first. For example: tools module and storage module.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-17587) Move test infrastructure out of core

2024-09-20 Thread David Arthur (Jira)
David Arthur created KAFKA-17587:


 Summary: Move test infrastructure out of core
 Key: KAFKA-17587
 URL: https://issues.apache.org/jira/browse/KAFKA-17587
 Project: Kafka
  Issue Type: Improvement
  Components: core
Reporter: David Arthur


Currently, our integration test infrastructure exists in the ":core" module's 
test sources. This means that any integration test must exist in 
"core/src/test/java" or "core/src/test/scala". 

This has two negative consequences. First, it means most of our integration 
tests live in "core" which is why that module's test time is by far the 
highest. The other related problem is that modules cannot easily define 
integration tests in their directory due to circularity with the core test 
dependency. 

For example, ":metadata" could not add ClusterTests because that would require 
a dependency on "project(':core').sourceSets.test.output" – but this can't 
happen because ":core" depends on ":metadata".

We should refactor our test infrastructure classes so that we can untangle 
these dependencies.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-17567) Remove TestTruncate

2024-09-20 Thread Chia-Ping Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-17567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chia-Ping Tsai resolved KAFKA-17567.

Fix Version/s: 4.0.0
   Resolution: Fixed

> Remove TestTruncate
> ---
>
> Key: KAFKA-17567
> URL: https://issues.apache.org/jira/browse/KAFKA-17567
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Chia-Ping Tsai
>Assignee: bboyleonp
>Priority: Minor
> Fix For: 4.0.0
>
>
> as title



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-17589) Move JUnit extensions to test-common module

2024-09-20 Thread David Arthur (Jira)
David Arthur created KAFKA-17589:


 Summary: Move JUnit extensions to test-common module
 Key: KAFKA-17589
 URL: https://issues.apache.org/jira/browse/KAFKA-17589
 Project: Kafka
  Issue Type: Sub-task
Reporter: David Arthur
Assignee: David Arthur


Create a new Gradle module "test-common" where our JUnit extensions and 
annotations can live. This module should have minimal dependencies.

We can also use this module as a place to keep a single copy of 
junit-platform.properties



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-17588) Remove internal classes from ClusterInstance

2024-09-20 Thread David Arthur (Jira)
David Arthur created KAFKA-17588:


 Summary: Remove internal classes from ClusterInstance
 Key: KAFKA-17588
 URL: https://issues.apache.org/jira/browse/KAFKA-17588
 Project: Kafka
  Issue Type: Sub-task
Reporter: David Arthur
Assignee: David Arthur


ClusterInstance is the interface used by our integration tests for interacting 
with the cluster at runtime. Originally, this only included things like the 
topology of the cluster, configs, and some helpers for AdminClient. Over time, 
we have exposed more and more internal classes through this interface which has 
created a tight coupling with the "core" module.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Jenkins build is still unstable: Kafka » Kafka Branch Builder » trunk #3322

2024-09-20 Thread Apache Jenkins Server
See 




Re: [DISCUSS] KIP-1073 Return inactive observer nodes in DescribeQuorum response

2024-09-20 Thread José Armando García Sancio
Thanks for the updates. The latest KIP LGTM.

On Fri, Sep 20, 2024 at 8:38 AM Gantigmaa Selenge  wrote:
>
> Hi everyone,
>
> Thanks for reviewing the KIP again! I have updated it to address the latest
> comments from Colin and Jose.
>
> Going back to Jose's earlier comment here:
> > Did you consider making "includeFencedBroker" field value implicitly
> based on the DescribeClusterRequest version and the EndpointType? For
> example, if the request version is 2 and the endpoint type is broker,
> the response will include all of the brokers and if they are fenced or
> unfenced. I am suggesting this because a request of endpointType
> controller and includeFencedBroker true doesn't make sense. In
> general, users of the AdminClient cannot easily discover the API
> versions supported by the destination node since the destination node
> may not be known when the admin call is made.
>
> I think admins would request fenced brokers only in certain scenarios, so
> perhaps it makes sense to include them only when requested via the new
> option. If the endpoint type is controller and the option to include fenced
> brokers set to true, the Admin Client can throw an
> "IllegalArgumentException" without sending the request to the controller.
>
> If you are happy with the updates I made, I would like to start the vote on
> this KIP. Thank you.
>
> Regards,
> Gantigmaa
>
>
> On Sat, Sep 14, 2024 at 9:51 PM Colin McCabe  wrote:
>
> > Hi Gantigmaa Selenge,
> >
> > I think we should have a boolean in the request like "includeFenced" which
> > defaults to false. Then if we can't include the fenced brokers, we throw an
> > UnsupportedVersionException and fall back to the old behavior.
> >
> > There are two main reasons for this:
> > 1. if we don't do this, we're changing the behavior for old clients in an
> > incompatible way. They used to get only unfenced brokers, now they get both
> > and they have no way of telling which is which.
> >
> > 2. new clients still would like to know whether they're getting
> > everything, or just unfenced brokers, for informational reasons.
> >
> > Probably the command-line tool could print out something like "fenced
> > brokers will not be shown" in the case where it's talking to an old server.
> > Or the tool could have a flag like --show-fenced which fails if it can't be
> > done?
> >
> > best,
> > Colin
> >
> >
> > On Fri, Sep 13, 2024, at 08:28, José Armando García Sancio wrote:
> > > Thanks Gantigmaa. See comments below. The KIP LGTM after this.
> > >
> > > On Thu, Sep 12, 2024 at 4:50 AM Gantigmaa Selenge 
> > wrote:
> > >> Will this make it more confusing for combined nodes? Users might expect
> > to
> > >> see both controller and broker under ROLES if it is a combined node.
> > Since
> > >> we would be able to output only one depending on the bootstrap option, I
> > >> wonder if it's better to stay consistent with the information in the
> > result
> > >> returned by the API.
> > >
> > > How about ENDPOINT-TYPE for the column name? That matches the keyword
> > > used in the request and better matches the semantic of the output.
> > >
> > > Also, I am okay if you want to leave it out. I thought the user may
> > > find it useful but if you find it confusing we can leave it out.
> > >
> > > Thanks,
> > > --
> > > -José
> >
> >



-- 
-José


Re: [DISCUSS] KIP-1090 Flaky Tests 👻

2024-09-20 Thread José Armando García Sancio
Thanks for the proposal David.

Can modules opt out of this feature? For example, the raft module
doesn't have any integration tests and all of the tests are meant to
be deterministic. It would be dangerous to the protocol's correctness
and the consistency of the cluster metadata to allow contributors to
mark tests as flaky in the raft module instead of investigating the
source of the failure.

On Fri, Sep 20, 2024 at 12:07 PM David Arthur  wrote:
>
> Chia-Ping, I think something like that can work. I was also thinking about
> extracting the test names during trunk builds using Gradle and storing that
> somewhere. I think it's fair to say we can derive this data from Git and
> Develocity. We can probably figure out the implementation details later on
> :)
>
> I've updated the KIP to describe the two-tiered approach as well as
> including a diagram.
>
> -David
>
>
> On Thu, Sep 19, 2024 at 12:40 PM Chia-Ping Tsai  wrote:
>
> > > However, this doesn't help with the newly added tests that introduce
> > flakiness. For this, we need some way
> > to detect when a new test has been added. Still thinking through this...
> >
> > Maybe we can use git + gradle develocity to address it.
> >
> > 1) list the files changed recently (for example: git diff --name-only "@{7
> > days ago}")
> > 2) use the test-related filename to query Gradle Develocity with "7 days
> > ago"
> > 3) the test cases having fewer builds are viewed as "new tests"
> >
> > WDYT?
> >
> > David Arthur  於 2024年9月19日 週四 下午11:56寫道:
> >
> > > TengYao,
> > >
> > > > These two mechanisms are independent. We could manually remove a tag
> > from
> > > a
> > > test, but at the same time, it might still be quarantined.
> > > I know the above situation might sound weird, but I just want to
> > understand
> > > how it would work.
> > >
> > > If we remove a tag from a test, we are signaling that we *think* it is
> > now
> > > stable. However, until we have data
> > > to prove that, we can run the test in the quarantine. This helps with the
> > > situation where there are multiple
> > > sources of flakiness and we are making iterative progress on improving
> > it.
> > >
> > > The alternative is to keep the fix in a PR until the developer gets
> > enough
> > > signal from the PR builds to be
> > > convinced that the flakiness is gone. I mention in the KIP that this
> > > approach has issues.
> > >
> > >
> > > Chia-Ping,
> > >
> > > > If so, could we query the Gradle develocity to get flaky and then make
> > > only those flaky
> > > retryable in CI?
> > >
> > > This can work for existing flaky tests. We can query Develocity to find
> > > flaky tests in the last N days
> > > and run only those tests with retry. This is what I meant by "data
> > driven"
> > > for the quarantine. So, no need to track
> > > "quarantined.txt" file separately -- we can derive this data at build
> > time
> > > from Develocity.
> > >
> > > However, this doesn't help with the newly added tests that introduce
> > > flakiness. For this, we need some way
> > > to detect when a new test has been added. Still thinking through this...
> > >
> > > -David
> > >
> > > On Thu, Sep 19, 2024 at 11:33 AM Chia-Ping Tsai 
> > > wrote:
> > >
> > > > hi David
> > > >
> > > > The two-tiered approach is interesting and I have questions similar to
> > > > TengYao.
> > > >
> > > > BUT, go back to the usage of quarantine and isolation. It seems to me
> > > they
> > > > are used to make our CI not be noised by the flaky, right? If so, could
> > > we
> > > > query the Gradle develocity to get flaky and then make only those flaky
> > > > retryable in CI? That means a non-flaky (stable) test must pass without
> > > > retry.
> > > >
> > > > This approach is more simple (I guess), and it can isolate the flaky
> > (by
> > > > retry) from CI. If a flaky can't pass after retry, I feel it is a
> > > critical
> > > > failed test rather than just a flaky. On top of that, this approach
> > does
> > > > not need an extra file ("quarantined.txt") in git control and it does
> > not
> > > > need to change any code of test classes.
> > > >
> > > > Best,
> > > > Chia-Ping
> > > >
> > > > TengYao Chi  於 2024年9月19日 週四 下午11:19寫道:
> > > >
> > > > > Hi David,
> > > > >
> > > > > Thanks for the explanation.
> > > > >
> > > > > I like this two-tiered approach, which gives us more flexibility to
> > > > handle
> > > > > flaky tests.
> > > > >
> > > > > The following is my understanding of how it works; please correct me
> > if
> > > > I'm
> > > > > wrong:
> > > > > If we adopt the two-tiered approach, the test might have two
> > > > > states.(Isolated by developer, Quarantined automatically).
> > > > > These two mechanisms are independent. We could manually remove a tag
> > > > from a
> > > > > test, but at the same time, it might still be quarantined.
> > > > > I know the above situation might sound weird, but I just want to
> > > > understand
> > > > > how it would work.
> > > > >
> > > > > Best Regards,
> > > > > TengYao
> > > > >
> > >

[jira] [Created] (KAFKA-17583) kafka-config script cannot set `cleanup.policy=delete,compact`

2024-09-20 Thread Luke Chen (Jira)
Luke Chen created KAFKA-17583:
-

 Summary: kafka-config script cannot set 
`cleanup.policy=delete,compact`
 Key: KAFKA-17583
 URL: https://issues.apache.org/jira/browse/KAFKA-17583
 Project: Kafka
  Issue Type: Bug
Affects Versions: 3.8.0, 3.5.0
Reporter: Luke Chen


Kafka-config.sh cannot set configs with "," in the value, because we split the 
option value with "," 
[here|https://github.com/apache/kafka/blob/3783385dc1cc27246cf09ec791e4b43f577a26ea/core/src/main/scala/kafka/admin/ConfigCommand.scala#L300].

{code:java}
> bin/kafka-configs.sh --bootstrap-server localhost:9092 --entity-type topics 
> --entity-name t1 --alter --add-config cleanup.policy=delete,compact

requirement failed: Invalid entity config: all configs to be added must be in 
the format "key=val".
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-17586) AsyncKafkaConsumer#seek should NOT wait the completion of backgound

2024-09-20 Thread Chia-Ping Tsai (Jira)
Chia-Ping Tsai created KAFKA-17586:
--

 Summary: AsyncKafkaConsumer#seek should NOT wait the completion of 
backgound
 Key: KAFKA-17586
 URL: https://issues.apache.org/jira/browse/KAFKA-17586
 Project: Kafka
  Issue Type: Bug
Reporter: Chia-Ping Tsai
Assignee: Chia-Ping Tsai


see https://github.com/apache/kafka/pull/17230/files#r1768666772



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-17307) Remove junit-platform.properties from test JARs

2024-09-20 Thread David Arthur (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-17307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Arthur resolved KAFKA-17307.
--
Resolution: Duplicate

> Remove junit-platform.properties from test JARs
> ---
>
> Key: KAFKA-17307
> URL: https://issues.apache.org/jira/browse/KAFKA-17307
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 3.7.1
>Reporter: Cosimo Damiano Prete
>Priority: Critical
> Fix For: 4.0.0
>
>
> Hello.
> When using Kafka in tests, some {{test(-test)}} JARs are pulled which contain 
> a {{junit-platform.properties}} that clashes with the one I've already on the 
> classpath.
> I've built a minimal reproducible example in 
> [https://github.com/cdprete/kafka-junit-platform-props].
> Could it be possible to remove these {{test(-test)}} JARs or, at least, 
> the\{{ junit-platform.properties}} files?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] KIP-1090 Flaky Tests 👻

2024-09-20 Thread Chia-Ping Tsai
> Can modules opt out of this feature? For example, the raft module
doesn't have any integration tests and all of the tests are meant to
be deterministic. It would be dangerous to the protocol's correctness
and the consistency of the cluster metadata to allow contributors to
mark tests as flaky in the raft module instead of investigating the
source of the failure.

IMHO, the rule should be "unit test should always be deterministic
(non-retryable)"

Best,
Chia-Ping

José Armando García Sancio  於 2024年9月21日 週六
上午9:01寫道:

> Thanks for the proposal David.
>
> Can modules opt out of this feature? For example, the raft module
> doesn't have any integration tests and all of the tests are meant to
> be deterministic. It would be dangerous to the protocol's correctness
> and the consistency of the cluster metadata to allow contributors to
> mark tests as flaky in the raft module instead of investigating the
> source of the failure.
>
> On Fri, Sep 20, 2024 at 12:07 PM David Arthur  wrote:
> >
> > Chia-Ping, I think something like that can work. I was also thinking
> about
> > extracting the test names during trunk builds using Gradle and storing
> that
> > somewhere. I think it's fair to say we can derive this data from Git and
> > Develocity. We can probably figure out the implementation details later
> on
> > :)
> >
> > I've updated the KIP to describe the two-tiered approach as well as
> > including a diagram.
> >
> > -David
> >
> >
> > On Thu, Sep 19, 2024 at 12:40 PM Chia-Ping Tsai 
> wrote:
> >
> > > > However, this doesn't help with the newly added tests that introduce
> > > flakiness. For this, we need some way
> > > to detect when a new test has been added. Still thinking through
> this...
> > >
> > > Maybe we can use git + gradle develocity to address it.
> > >
> > > 1) list the files changed recently (for example: git diff --name-only
> "@{7
> > > days ago}")
> > > 2) use the test-related filename to query Gradle Develocity with "7
> days
> > > ago"
> > > 3) the test cases having fewer builds are viewed as "new tests"
> > >
> > > WDYT?
> > >
> > > David Arthur  於 2024年9月19日 週四 下午11:56寫道:
> > >
> > > > TengYao,
> > > >
> > > > > These two mechanisms are independent. We could manually remove a
> tag
> > > from
> > > > a
> > > > test, but at the same time, it might still be quarantined.
> > > > I know the above situation might sound weird, but I just want to
> > > understand
> > > > how it would work.
> > > >
> > > > If we remove a tag from a test, we are signaling that we *think* it
> is
> > > now
> > > > stable. However, until we have data
> > > > to prove that, we can run the test in the quarantine. This helps
> with the
> > > > situation where there are multiple
> > > > sources of flakiness and we are making iterative progress on
> improving
> > > it.
> > > >
> > > > The alternative is to keep the fix in a PR until the developer gets
> > > enough
> > > > signal from the PR builds to be
> > > > convinced that the flakiness is gone. I mention in the KIP that this
> > > > approach has issues.
> > > >
> > > >
> > > > Chia-Ping,
> > > >
> > > > > If so, could we query the Gradle develocity to get flaky and then
> make
> > > > only those flaky
> > > > retryable in CI?
> > > >
> > > > This can work for existing flaky tests. We can query Develocity to
> find
> > > > flaky tests in the last N days
> > > > and run only those tests with retry. This is what I meant by "data
> > > driven"
> > > > for the quarantine. So, no need to track
> > > > "quarantined.txt" file separately -- we can derive this data at build
> > > time
> > > > from Develocity.
> > > >
> > > > However, this doesn't help with the newly added tests that introduce
> > > > flakiness. For this, we need some way
> > > > to detect when a new test has been added. Still thinking through
> this...
> > > >
> > > > -David
> > > >
> > > > On Thu, Sep 19, 2024 at 11:33 AM Chia-Ping Tsai 
> > > > wrote:
> > > >
> > > > > hi David
> > > > >
> > > > > The two-tiered approach is interesting and I have questions
> similar to
> > > > > TengYao.
> > > > >
> > > > > BUT, go back to the usage of quarantine and isolation. It seems to
> me
> > > > they
> > > > > are used to make our CI not be noised by the flaky, right? If so,
> could
> > > > we
> > > > > query the Gradle develocity to get flaky and then make only those
> flaky
> > > > > retryable in CI? That means a non-flaky (stable) test must pass
> without
> > > > > retry.
> > > > >
> > > > > This approach is more simple (I guess), and it can isolate the
> flaky
> > > (by
> > > > > retry) from CI. If a flaky can't pass after retry, I feel it is a
> > > > critical
> > > > > failed test rather than just a flaky. On top of that, this approach
> > > does
> > > > > not need an extra file ("quarantined.txt") in git control and it
> does
> > > not
> > > > > need to change any code of test classes.
> > > > >
> > > > > Best,
> > > > > Chia-Ping
> > > > >
> > > > > TengYao Chi  於 2024年9月19日 週四 下午11:19寫道:
> > > > >
> > > > > >

Re: [DISCUSS] KIP-1090 Flaky Tests 👻

2024-09-20 Thread David Arthur
José,

By default, unit tests will not be eligible for the automatic new test
quarantine. Like you said, if a unit test fails, it indicates a problem.

Perhaps we could auto-fail a test marked as "flaky" that is not also tagged
as "integration"?



Chia-Ping,

>IMHO, the rule should be "unit test should always be deterministic
(non-retryable)"

I agree :)

On Fri, Sep 20, 2024 at 9:17 PM Chia-Ping Tsai  wrote:

> > Can modules opt out of this feature? For example, the raft module
> doesn't have any integration tests and all of the tests are meant to
> be deterministic. It would be dangerous to the protocol's correctness
> and the consistency of the cluster metadata to allow contributors to
> mark tests as flaky in the raft module instead of investigating the
> source of the failure.
>
> IMHO, the rule should be "unit test should always be deterministic
> (non-retryable)"
>
> Best,
> Chia-Ping
>
> José Armando García Sancio  於 2024年9月21日 週六
> 上午9:01寫道:
>
> > Thanks for the proposal David.
> >
> > Can modules opt out of this feature? For example, the raft module
> > doesn't have any integration tests and all of the tests are meant to
> > be deterministic. It would be dangerous to the protocol's correctness
> > and the consistency of the cluster metadata to allow contributors to
> > mark tests as flaky in the raft module instead of investigating the
> > source of the failure.
> >
> > On Fri, Sep 20, 2024 at 12:07 PM David Arthur  wrote:
> > >
> > > Chia-Ping, I think something like that can work. I was also thinking
> > about
> > > extracting the test names during trunk builds using Gradle and storing
> > that
> > > somewhere. I think it's fair to say we can derive this data from Git
> and
> > > Develocity. We can probably figure out the implementation details later
> > on
> > > :)
> > >
> > > I've updated the KIP to describe the two-tiered approach as well as
> > > including a diagram.
> > >
> > > -David
> > >
> > >
> > > On Thu, Sep 19, 2024 at 12:40 PM Chia-Ping Tsai 
> > wrote:
> > >
> > > > > However, this doesn't help with the newly added tests that
> introduce
> > > > flakiness. For this, we need some way
> > > > to detect when a new test has been added. Still thinking through
> > this...
> > > >
> > > > Maybe we can use git + gradle develocity to address it.
> > > >
> > > > 1) list the files changed recently (for example: git diff --name-only
> > "@{7
> > > > days ago}")
> > > > 2) use the test-related filename to query Gradle Develocity with "7
> > days
> > > > ago"
> > > > 3) the test cases having fewer builds are viewed as "new tests"
> > > >
> > > > WDYT?
> > > >
> > > > David Arthur  於 2024年9月19日 週四 下午11:56寫道:
> > > >
> > > > > TengYao,
> > > > >
> > > > > > These two mechanisms are independent. We could manually remove a
> > tag
> > > > from
> > > > > a
> > > > > test, but at the same time, it might still be quarantined.
> > > > > I know the above situation might sound weird, but I just want to
> > > > understand
> > > > > how it would work.
> > > > >
> > > > > If we remove a tag from a test, we are signaling that we *think* it
> > is
> > > > now
> > > > > stable. However, until we have data
> > > > > to prove that, we can run the test in the quarantine. This helps
> > with the
> > > > > situation where there are multiple
> > > > > sources of flakiness and we are making iterative progress on
> > improving
> > > > it.
> > > > >
> > > > > The alternative is to keep the fix in a PR until the developer gets
> > > > enough
> > > > > signal from the PR builds to be
> > > > > convinced that the flakiness is gone. I mention in the KIP that
> this
> > > > > approach has issues.
> > > > >
> > > > >
> > > > > Chia-Ping,
> > > > >
> > > > > > If so, could we query the Gradle develocity to get flaky and then
> > make
> > > > > only those flaky
> > > > > retryable in CI?
> > > > >
> > > > > This can work for existing flaky tests. We can query Develocity to
> > find
> > > > > flaky tests in the last N days
> > > > > and run only those tests with retry. This is what I meant by "data
> > > > driven"
> > > > > for the quarantine. So, no need to track
> > > > > "quarantined.txt" file separately -- we can derive this data at
> build
> > > > time
> > > > > from Develocity.
> > > > >
> > > > > However, this doesn't help with the newly added tests that
> introduce
> > > > > flakiness. For this, we need some way
> > > > > to detect when a new test has been added. Still thinking through
> > this...
> > > > >
> > > > > -David
> > > > >
> > > > > On Thu, Sep 19, 2024 at 11:33 AM Chia-Ping Tsai <
> chia7...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > hi David
> > > > > >
> > > > > > The two-tiered approach is interesting and I have questions
> > similar to
> > > > > > TengYao.
> > > > > >
> > > > > > BUT, go back to the usage of quarantine and isolation. It seems
> to
> > me
> > > > > they
> > > > > > are used to make our CI not be noised by the flaky, right? If so,
> > could
> > > > > we
> > > > > > query the Gradle