[DISCUSS] KIP-1098: Reverse Checkpointing

2024-10-18 Thread Dániel Urbán
Hi everyone,

I'd like to start the discussion on KIP-1098: Reverse Checkpointing (
https://cwiki.apache.org/confluence/display/KAFKA/KIP-1098%3A+Reverse+Checkpointing)
which aims to minimize message reprocessing for consumers in failbacks.

TIA,
Daniel


Re: [DISCUSS] KIP-1098: Reverse Checkpointing

2024-10-18 Thread Viktor Somogyi-Vass
Hey Dan,

I think this is a very useful idea. Two questions:

SVV1: Do you think we need the feature flag at all? I know that not having
this flag may technically render the KIP unnecessary (however it may still
be useful to discuss this topic and create a concensus). As you wrote in
the KIP, we may be able to look up the target and source topics and if we
can do this, we can probably detect if the replication is one-way or
prefixless (identity). So that means we don't need this flag to control
when we want to use this. Then it is really just there to act as something
that can turn the feature on and off if needed, but I'm not really sure if
there is a great risk in just enabling this by default. If we really just
turn back the B -> A checkpoints and save them in the A -> B, then maybe
it's not too risky and users would get this immediately by just upgrading.

SVV2: You write that we need DefaultReplicationPolicy to use this feature,
but most of the functionality is there on interface level in
ReplicationPolicy. Is there anything that is missing from there and if so,
what do you think about pulling it into the interface? If this improvement
only works with the default replication policy, then it's somewhat limiting
as users may have their own policy for various reasons, but if we can make
it work on the interface level, then we could provide this feature to
everyone. Of course there can be replication policies like the identity one
that by design disallows this feature, but for that, see my previous point.

Best,
Viktor

On Fri, Oct 18, 2024 at 3:30 PM Dániel Urbán  wrote:

> Hi everyone,
>
> I'd like to start the discussion on KIP-1098: Reverse Checkpointing (
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1098%3A+Reverse+Checkpointing
> )
> which aims to minimize message reprocessing for consumers in failbacks.
>
> TIA,
> Daniel
>


Re: [VOTE] KIP-1073: Return fenced brokers in DescribeCluster response

2024-10-18 Thread Gantigmaa Selenge
Hi,

I have raised a draft PR implementing this KIP here:
https://github.com/apache/kafka/pull/17524, in case anyone is interested in
taking a look.

Regards,
Tina

On Mon, Oct 7, 2024 at 12:53 PM Gantigmaa Selenge 
wrote:

> Thank you for voting Federico and Luke!
>
> Is there anyone else who would like to vote on this KIP please? It still
> needs 2 more +1 binding :)
>
> Regards,
> Tina
>
> On Wed, Sep 25, 2024 at 8:40 AM Federico Valeri 
> wrote:
>
>> +1 non binding
>>
>> Thanks Tina
>>
>> On Wed, Sep 25, 2024 at 5:29 AM Luke Chen  wrote:
>> >
>> > Hi Tina,
>> >
>> > Thanks for the KIP.
>> > +1 (binding) from me.
>> >
>> > Luke
>> >
>> >
>> > On Mon, Sep 23, 2024 at 9:03 PM Gantigmaa Selenge 
>> > wrote:
>> >
>> > > Hi everyone,
>> > >
>> > > I would like to start voting for KIP-1073 that extends
>> DescribeCluster API
>> > > to optionally return fenced brokers in the response.
>> > >
>> > >
>> > >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1073%3A+Return+fenced+brokers+in+DescribeCluster+response
>> > >
>> > > Thanks.
>> > > Gantigmaa
>> > >
>>
>>


[jira] [Resolved] (KAFKA-17447) Changed fetch queue processing to reduce the no. of locking and unlocking activity

2024-10-18 Thread Apoorv Mittal (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-17447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apoorv Mittal resolved KAFKA-17447.
---
Resolution: Fixed

> Changed fetch queue processing to reduce the no. of locking and unlocking 
> activity
> --
>
> Key: KAFKA-17447
> URL: https://issues.apache.org/jira/browse/KAFKA-17447
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Abhinav Dixit
>Assignee: Abhinav Dixit
>Priority: Major
>
> For the share groups fetch request processing, we have an recursive approach 
> of dealing with individual fetch requests. While it works fine with less no. 
> of records (< 1,000,000) and lesser sharing (< 5 share consumers), it seems 
> that some requests are getting stuck when we increase the load and try to 
> increase the throughput. I've replaced this approach by removing the 
> unlocking and locking of fetch queue in between entries. This had reduced the 
> complexity and also removes the reliability issue on increasing the load.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-17654) Fix flaky ProducerIdManagerTest#testUnrecoverableErrors

2024-10-18 Thread Chia-Ping Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-17654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chia-Ping Tsai resolved KAFKA-17654.

Fix Version/s: 4.0.0
   Resolution: Fixed

> Fix flaky ProducerIdManagerTest#testUnrecoverableErrors
> ---
>
> Key: KAFKA-17654
> URL: https://issues.apache.org/jira/browse/KAFKA-17654
> Project: Kafka
>  Issue Type: Test
>Reporter: Chia-Ping Tsai
>Assignee: 黃竣陽
>Priority: Major
> Fix For: 4.0.0
>
>
> https://github.com/apache/kafka/actions/runs/11079975812/attempts/2#summary-30806776266



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-17835) Move ProducerIdManager and RPCProducerIdManager to server module

2024-10-18 Thread Chia-Ping Tsai (Jira)
Chia-Ping Tsai created KAFKA-17835:
--

 Summary: Move ProducerIdManager and RPCProducerIdManager to server 
module
 Key: KAFKA-17835
 URL: https://issues.apache.org/jira/browse/KAFKA-17835
 Project: Kafka
  Issue Type: Sub-task
Reporter: Chia-Ping Tsai
Assignee: 黃竣陽


as title. we don't need to touch zk impl BTW



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-17833) Convert DescribeAuthorizedOperationsTest to use kraft

2024-10-18 Thread PoAn Yang (Jira)
PoAn Yang created KAFKA-17833:
-

 Summary: Convert DescribeAuthorizedOperationsTest to use kraft
 Key: KAFKA-17833
 URL: https://issues.apache.org/jira/browse/KAFKA-17833
 Project: Kafka
  Issue Type: Bug
Reporter: PoAn Yang
Assignee: PoAn Yang


ref: https://github.com/apache/kafka/pull/17424#discussion_r1804513825



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] KIP-1092: Extend Consumer#close with an option to leave the group or not

2024-10-18 Thread TengYao Chi
Hello everyone,

The vote is now closed, and the KIP has been accepted with 3 binding +1s
from Chia-Ping, Sophie, Matthias, as well as 3 non-binding +1 from Andrew,
Kirk, Apoorv.

Thank you all for your participation!

Sincerely,
TengYao

Matthias J. Sax  於 2024年10月19日 週六 上午1:54寫道:

> +1 (binding)
>
>
>
> On 10/11/24 3:15 AM, Apoorv Mittal wrote:
> > Hi TengYao,
> > Thanks for the KIP.
> >
> > +1 (non-binding)
> >
> > Regards,
> > Apoorv Mittal
> >
> >
> > On Fri, Oct 11, 2024 at 12:55 AM Sophie Blee-Goldman <
> sop...@responsive.dev>
> > wrote:
> >
> >> +1 (binding)
> >>
> >> Thanks for this KIP!
> >>
> >> On Mon, Oct 7, 2024 at 9:22 AM Kirk True  wrote:
> >>
> >>> Hi TengYao,
> >>>
> >>> +1 (non-binding)
> >>>
> >>> Thanks for all the work so far on this.
> >>>
> >>> Kirk
> >>>
> >>> On Mon, Oct 7, 2024, at 4:09 AM, TengYao Chi wrote:
>  Hi Andrew,
> 
>  Thanks for reviewing and participating in the vote.
>  I have corrected the issue as you pointed out.
> 
>  Sincerely,
>  TengYao
> 
>  Andrew Schofield  於 2024年10月7日 週一
>  下午6:44寫道:
> 
> > Thanks for the KIP.
> >
> > +1 (non-binding)
> >
> > I have one tiny nit that in the text but not the code snippet, you
> >>> mention
> > methods called
> > CloseOption.withTimeoutFluent(Duration timeout) and
> >
> >> CloseOption.withGroupMembershipOperationFluent(GroupMembershipOperation
> > operation).
> > I expect you meant to remove "Fluent" in all cases so the text
> >> matches
> >>> the
> > code snippet.
> >
> > Andrew
> >
> > 
> > From: TengYao Chi 
> > Sent: 07 October 2024 08:44
> > To: dev@kafka.apache.org 
> > Subject: Re: [VOTE] KIP-1092: Extend Consumer#close with an option to
> > leave the group or not
> >
> > Hi Chia-Ping,
> >
> > Thanks for pointing that out.
> > I originally wrote the full version to show the equivalent semantic
> > example, but you're absolutely right—it can be simplified by omitting
> > `.withGroupMembershipOperation(GroupMembershipOperation.DEFAULT)`
> >> since
> > it's the default value.
> >
> > This actually gave me an idea to include the shorter version for
> >>> clarity in
> > the explanation.
> > I will update it accordingly.
> >
> > Sincerely,
> > TengYao
> >
> > Chia-Ping Tsai  於 2024年10月7日 週一 下午1:13寫道:
> >
> >> +1 (binding)
> >>
> >> nit: in the "Migration Plan"
> >>
> >> ```
> >> consumer.close(CloseOption.timeout(Duration.ofSeconds(30))
> >>
> >>   .withGroupMembershipOperation(GroupMembershipOperation.DEFAULT));
> >> ```
> >>
> >> The sample above can likely be simplified, right?
> >>
> >> ```
> >> consumer.close(CloseOption.timeout(Duration.ofSeconds(30)));
> >> ```
> >>
> >> Best,
> >> Chia-Ping
> >>
> >> TengYao Chi  於 2024年10月7日 週一 上午10:23寫道:
> >>
> >>> Hi everyone,
> >>>
> >>> Based on our discussion
> >>> <
> >> https://lists.apache.org/thread/023mo7lk1vfvljjoovwbzwmw9wvf5t6m>
> >>>   regarding KIP-1092 <
> >> https://cwiki.apache.org/confluence/x/JQstEw>,
> >>> I
> >>> believe this KIP is now ready for a vote.
> >>>
> >>> Sincerely,
> >>> TengYao
> >>>
> >>
> >
> 
> >>>
> >>
> >
>


Re: [DISCUSS] KIP-1050: Consistent error handling for Transactions

2024-10-18 Thread Artem Livshits
Hi Kaushik,

Thank you for the KIP!  I think it'll make writing transactional
application easier and less error prone.  I have a couple comments:

AL1.  The keep proposes changing the semantics
of UnknownProducerIdException.  Currently, this error is never returned by
the broker so we cannot validate whether changing semantics is going to be
compatible with a future use (if we ever return it again) and the future
broker wouldn't know which sematnics the client supports.  I think for now
we could just add a comment in the code that this error is never returned
by the broker and if we ever add it to the broker we'd need to make sure
it's categorized properly and bump the API version so that the broker knows
if the client supports required semantics.

AL2. The exception classification looks good to me from the producer
perspective: we have retriable, abortable, etc. categories.  The retriable
errors can be retried within producer safely without losing idempotence,
the messages would be retried with the same sequence numbers and we won't
have duplicates.  However, if a retriable error (e.g. TimeoutException) is
thrown to the application, it's not safe to retry the produce, because the
messages will get new sequence numbers and if the original message in fact
succeeded, the new messages would become duplicates, losing the exactly
once semantics.  Thus, if a retriable exception is thrown from a
transactional "produce" we should abort the transaction to guarantee that
we won't have duplicates when we produce messages again.  To avoid
confusion, I propose to change the transactional "produce" to never return
a retriable error, i.e. any retriable error would be translated
into TransactionAbortableException (the error message could be copied from
the original error) before getting thrown from transactional "produce".
This would avoid bugs in the applicatoin -- even if the application
developer gets confused and decides to retry on TimeoutException, the buggy
code would never run.

Another change that we should make is to make sure .abortTransaction()
should never throw producer-abortable exception.

The handling of retriable errors for other APIs can stay unchanged --
.commitTransaction() and .abortTransaction() are idempotent and can be
safely retried, read-only APIs can always be safely retried, and for
non-transactional producer we don't have a good "path back to normal"
anyway -- we cannot reliably abort messages with unknown fate.

-Artem

On Fri, Oct 4, 2024 at 5:11 AM Lianet M.  wrote:

> Hello, thanks for the KIP! After going through the KIP and discussion here
> are some initial comments.
>
> 107 - I understand we’re  proposing a new
> ProducerRetriableTransactionException, and changing existing exceptions to
> inherit from it (the ones on the table below it).  The existing exceptions
> inherit from RetriableException today, but with this KIP, they would
> inherit from ProducerRetriableTransactionException which is not a
> RetriableException ("*ProducerRetriableTransactionException extends
> ApiException"*). Is my understanding correct? Wouldn’t this break
> applications that could be handling/expecting RetriableExceptions today?
> (Ie. apps dealing with TimeoutException on send , if they have
> catch(RetriableException) or checks in the form of instanceOf
> RetriableException, would need to change to the new
>  ProducerRetriableTransactionException or the specific TimeoutException,
> right?). I get this wouldn’t bring a problem for most of the retriable
> exceptions on the table given that they end up being handled/retried
> internally, but TimeoutException is tricky.
>
>
> 108 -  Regarding how we limit the scope of the change to the
> producer/transactional API. TimeoutException is not only used in the
> transactional API, but also in the consumer API, propagated to the user in
> multiple api calls. Not clear to me how with this proposal we wouldn’t end
> up with a consumer throwing a TimeoutException instanceOf
> ProducerRetriableTransactionException? (Instead of instanceOf
> RetriableException like it is today)? Again, potentially breaking apps but
> also with a conceptually wrong consumer error?
>
>
> 109 - Similar to above, for exceptions like
> UnknownTopicOrPartitionException, which are today instanceOf
> RetriableException, if we’re saying they will be subclass of
> ProducerRefreshRetriableTransactionException (ApiException) that will
> affect the consumer logic too, where we do handle RetriableExceptions like
> the unknownTopic, expecting RetriableException. This is all internal logic
> and could be updated as needed of course, but without leaking
> producer-specific groupings into the consumer I would expect.
>
>
> 110 - The KIP refers to the existing TransactionAbortableException (from
> KIP-890), but on the public changes it refers to class
> AbortableTransactionException extends ApiException. So are we proposing a
> new exception type for this or reusing the existing one?
>
> 111 - I notice th

[jira] [Resolved] (KAFKA-12894) KIP-750: Drop support for Java 8 in Kafka 4.0 (deprecate in 3.0)

2024-10-18 Thread Chia-Ping Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chia-Ping Tsai resolved KAFKA-12894.

Resolution: Fixed

all sub tasks are completed. resolve this now :)

> KIP-750: Drop support for Java 8 in Kafka 4.0 (deprecate in 3.0)
> 
>
> Key: KAFKA-12894
> URL: https://issues.apache.org/jira/browse/KAFKA-12894
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Ismael Juma
>Assignee: Ismael Juma
>Priority: Major
>  Labels: kip
> Fix For: 4.0.0
>
>
> We propose deprecating Java 8 support in Apache Kafka 3.0 and dropping 
> support in Apache Kafka 4.0.
> KIP: 
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=181308223



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-17832) Remove all EnabledForJreRange

2024-10-18 Thread Chia-Ping Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-17832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chia-Ping Tsai resolved KAFKA-17832.

Fix Version/s: 4.0.0
   Resolution: Fixed

> Remove all EnabledForJreRange
> -
>
> Key: KAFKA-17832
> URL: https://issues.apache.org/jira/browse/KAFKA-17832
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Chia-Ping Tsai
>Assignee: Jhen Yung Hsu
>Priority: Major
> Fix For: 4.0.0
>
>
> the min=JDK8 is not required now



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Apache Kafka 4.0.0 release

2024-10-18 Thread TengYao Chi
Hi David,

I would like to request the inclusion of KIP-1092 in the 4.0 release plan.
This KIP is newly accepted, and I believe it will bring valuable
enhancements for both the Consumer and Kafka Streams components.

Thank you for considering this addition.

Best,
TengYao

Colin McCabe  於 2024年10月11日 週五 上午12:22寫道:

> Yes (although I don't believe there's a JIRA for that specifically yet.)
>
> best,
> Colin
>
>
> On Thu, Oct 10, 2024, at 09:11, David Jacot wrote:
> > Hi Colin,
> >
> > Thanks. That seems perfect. Does it include removing --zookeeper from the
> > command line tools (e.g. kafka-configs)?
> >
> > Best,
> > David
> >
> > On Thu, Oct 10, 2024 at 6:06 PM Colin McCabe  wrote:
> >
> >> Hi David & Luke,
> >>
> >> We have been using https://issues.apache.org/jira/browse/KAFKA-17611 as
> >> the umbrella JIRA for ZK removal tasks.  Progress has been pretty
> rapid, I
> >> do think we will get there by January. (Well hopefully even earlier :)
> >>
> >> best,
> >> Colin
> >>
> >>
> >> On Thu, Oct 10, 2024, at 05:55, David Jacot wrote:
> >> > Hi Luke,
> >> >
> >> > That's a good point. I think that we should try to stick to the dates
> >> > though. In my opinion, we should ensure that ZK and all the related
> >> public
> >> > facing apis are gone in 4.0 by code freeze. The simplest way would be
> to
> >> > have an epic for the removal with all the related tasks. We can then
> mark
> >> > it as a blocker for 4.0. We may already have one. Let me search
> around.
> >> >
> >> > Best,
> >> > David
> >> >
> >> > On Thu, Oct 10, 2024 at 2:43 PM Luke Chen  wrote:
> >> >
> >> >> Hi David,
> >> >>
> >> >> The release plan looks good to me.
> >> >>
> >> >> But since the 4.0 release is a release without ZK, I'm wondering if
> we
> >> >> should list some release criteria for it?
> >> >> The Problem I can imagine is like, when the code freeze date is
> reached,
> >> >> but there are still many ZK removal tasks open, what should we do
> with
> >> >> them? We can remove the rest of codes in later releases, but if there
> >> are
> >> >> still some public APIs related to ZK open for users to use, is that
> >> still
> >> >> OK to release 4.0? Ex: The `--zookeeper` option in kafka-configs.sh?
> >> >>
> >> >> Instead of only using time-based management in v4.0, I'm suggesting
> we
> >> >> should list some "must-do" tasks and track them when milestone date
> >> >> approaches. And we can discuss with the community if there are some
> >> delayed
> >> >> tasks. Does that make sense?
> >> >>
> >> >> Thanks.
> >> >> Luke
> >> >>
> >> >>
> >> >>
> >> >> On Mon, Oct 7, 2024 at 9:29 PM David Jacot
>  >> >
> >> >> wrote:
> >> >>
> >> >> > Hi all,
> >> >> >
> >> >> > I have drafted the release plan for Apache Kafka 4.0.0:
> >> >> >
> https://cwiki.apache.org/confluence/display/KAFKA/Release+Plan+4.0.0.
> >> >> > Please take a look and let me know what you think.
> >> >> >
> >> >> > I would also appreciate it if you could review the list of KIPs
> that
> >> we
> >> >> > will ship in this release. I took the list from the approved list
> in
> >> the
> >> >> > wiki but I am pretty sure that the list is not accurate.
> >> >> >
> >> >> > Best,
> >> >> > Davud
> >> >> >
> >> >> > On Wed, Sep 25, 2024 at 5:02 AM Luke Chen 
> wrote:
> >> >> >
> >> >> > > +1 from me.
> >> >> > > Since v4.0 will be a huge change release, I'm not sure if you
> need
> >> >> > another
> >> >> > > person (vice release manager?) to help on releasing tasks.
> >> >> > > If so, I'm willing to help.
> >> >> > >
> >> >> > > Thanks.
> >> >> > > Luke
> >> >> > >
> >> >> > > On Wed, Sep 25, 2024 at 1:33 AM Colin McCabe  >
> >> >> wrote:
> >> >> > >
> >> >> > > > +1. Thanks, David.
> >> >> > > >
> >> >> > > > best,
> >> >> > > > Colin
> >> >> > > >
> >> >> > > > On Mon, Sep 23, 2024, at 10:11, Chris Egerton wrote:
> >> >> > > > > Thanks David! +1
> >> >> > > > >
> >> >> > > > > On Mon, Sep 23, 2024, 13:07 José Armando García Sancio
> >> >> > > > >  wrote:
> >> >> > > > >
> >> >> > > > >> +1, thanks for volunteering David!
> >> >> > > > >>
> >> >> > > > >> On Mon, Sep 23, 2024 at 11:56 AM David Arthur <
> >> mum...@gmail.com>
> >> >> > > wrote:
> >> >> > > > >> >
> >> >> > > > >> > +1, thanks David!
> >> >> > > > >> >
> >> >> > > > >> > On Mon, Sep 23, 2024 at 11:48 AM Satish Duggana <
> >> >> > > > >> satish.dugg...@gmail.com>
> >> >> > > > >> > wrote:
> >> >> > > > >> >
> >> >> > > > >> > > +1
> >> >> > > > >> > > Thanks David for volunteering!
> >> >> > > > >> > >
> >> >> > > > >> > > On Mon, 23 Sept 2024 at 19:27, Lianet M. <
> >> liane...@gmail.com>
> >> >> > > > wrote:
> >> >> > > > >> > > >
> >> >> > > > >> > > > +1
> >> >> > > > >> > > >
> >> >> > > > >> > > > Thanks David!
> >> >> > > > >> > > >
> >> >> > > > >> > > > On Mon, Sep 23, 2024, 9:54 a.m. Justine Olshan
> >> >> > > > >> > > 
> >> >> > > > >> > > > wrote:
> >> >> > > > >> > > >
> >> >> > > > >> > > > > +1 and thanks!
> >> >> > > > >> > > > >
> >> >> > > > >> > > > > On Mon, Sep 23, 2024 at 6:36 AM Chia-Ping Ts

[jira] [Resolved] (KAFKA-17817) Remove cache from FetchRequest#fetchData

2024-10-18 Thread Chia-Ping Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-17817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chia-Ping Tsai resolved KAFKA-17817.

Fix Version/s: 4.0.0
   Resolution: Fixed

> Remove cache from FetchRequest#fetchData
> 
>
> Key: KAFKA-17817
> URL: https://issues.apache.org/jira/browse/KAFKA-17817
> Project: Kafka
>  Issue Type: Bug
>Reporter: Chia-Ping Tsai
>Assignee: kangning.li
>Priority: Minor
> Fix For: 4.0.0
>
>
> This is similar to KAFKA-16684 and KAFKA-17102
> We don't reuse the request after generating the cache, and the cache is only 
> valid if the input remains identical. This makes the method unreliable...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-17834) Improvement the e2e Dockerfile

2024-10-18 Thread Jira
黃竣陽 created KAFKA-17834:
---

 Summary: Improvement the e2e Dockerfile
 Key: KAFKA-17834
 URL: https://issues.apache.org/jira/browse/KAFKA-17834
 Project: Kafka
  Issue Type: Improvement
Affects Versions: 4.0.0
Reporter: 黃竣陽
Assignee: 黃竣陽


In my local machine, docker show below warnings when I build the Dockerfile, we 
should improve the Dockerfile to avoid these warnings

 3 warnings found (use docker --debug to expand):
 - MaintainerDeprecated: Maintainer instruction is deprecated in favor of using 
label (line 49)
 - LegacyKeyValueFormat: "ENV key=value" should be used instead of legacy "ENV 
key value" format (line 56)
 - JSONArgsRecommended: JSON arguments recommended for CMD to prevent 
unintended behavior related to OS signals (line 158)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] KIP-1092: Extend Consumer#close with an option to leave the group or not

2024-10-18 Thread Matthias J. Sax

+1 (binding)



On 10/11/24 3:15 AM, Apoorv Mittal wrote:

Hi TengYao,
Thanks for the KIP.

+1 (non-binding)

Regards,
Apoorv Mittal


On Fri, Oct 11, 2024 at 12:55 AM Sophie Blee-Goldman 
wrote:


+1 (binding)

Thanks for this KIP!

On Mon, Oct 7, 2024 at 9:22 AM Kirk True  wrote:


Hi TengYao,

+1 (non-binding)

Thanks for all the work so far on this.

Kirk

On Mon, Oct 7, 2024, at 4:09 AM, TengYao Chi wrote:

Hi Andrew,

Thanks for reviewing and participating in the vote.
I have corrected the issue as you pointed out.

Sincerely,
TengYao

Andrew Schofield  於 2024年10月7日 週一
下午6:44寫道:


Thanks for the KIP.

+1 (non-binding)

I have one tiny nit that in the text but not the code snippet, you

mention

methods called
CloseOption.withTimeoutFluent(Duration timeout) and


CloseOption.withGroupMembershipOperationFluent(GroupMembershipOperation

operation).
I expect you meant to remove "Fluent" in all cases so the text

matches

the

code snippet.

Andrew


From: TengYao Chi 
Sent: 07 October 2024 08:44
To: dev@kafka.apache.org 
Subject: Re: [VOTE] KIP-1092: Extend Consumer#close with an option to
leave the group or not

Hi Chia-Ping,

Thanks for pointing that out.
I originally wrote the full version to show the equivalent semantic
example, but you're absolutely right—it can be simplified by omitting
`.withGroupMembershipOperation(GroupMembershipOperation.DEFAULT)`

since

it's the default value.

This actually gave me an idea to include the shorter version for

clarity in

the explanation.
I will update it accordingly.

Sincerely,
TengYao

Chia-Ping Tsai  於 2024年10月7日 週一 下午1:13寫道:


+1 (binding)

nit: in the "Migration Plan"

```
consumer.close(CloseOption.timeout(Duration.ofSeconds(30))


  .withGroupMembershipOperation(GroupMembershipOperation.DEFAULT));

```

The sample above can likely be simplified, right?

```
consumer.close(CloseOption.timeout(Duration.ofSeconds(30)));
```

Best,
Chia-Ping

TengYao Chi  於 2024年10月7日 週一 上午10:23寫道:


Hi everyone,

Based on our discussion
<

https://lists.apache.org/thread/023mo7lk1vfvljjoovwbzwmw9wvf5t6m>

  regarding KIP-1092 <

https://cwiki.apache.org/confluence/x/JQstEw>,

I

believe this KIP is now ready for a vote.

Sincerely,
TengYao















[jira] [Resolved] (KAFKA-17827) cleanup the mockit version

2024-10-18 Thread Chia-Ping Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-17827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chia-Ping Tsai resolved KAFKA-17827.

Fix Version/s: 4.0.0
   Resolution: Fixed

> cleanup the mockit version
> --
>
> Key: KAFKA-17827
> URL: https://issues.apache.org/jira/browse/KAFKA-17827
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Chia-Ping Tsai
>Assignee: 黃竣陽
>Priority: Major
> Fix For: 4.0.0
>
>
> as title. we don't need to check the java version now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-17832) Remove all EnabledForJreRange

2024-10-18 Thread Chia-Ping Tsai (Jira)
Chia-Ping Tsai created KAFKA-17832:
--

 Summary: Remove all EnabledForJreRange
 Key: KAFKA-17832
 URL: https://issues.apache.org/jira/browse/KAFKA-17832
 Project: Kafka
  Issue Type: Sub-task
Reporter: Chia-Ping Tsai
Assignee: Jhen Yung Hsu


the min=JDK8 is not required now



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-17780) Heartbeat interval is not configured in the heartbeatrequest manager

2024-10-18 Thread Arpit Goyal (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-17780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arpit Goyal resolved KAFKA-17780.
-
Resolution: Invalid

> Heartbeat interval is not configured in the heartbeatrequest manager
> 
>
> Key: KAFKA-17780
> URL: https://issues.apache.org/jira/browse/KAFKA-17780
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, consumer, kip
>Reporter: Arpit Goyal
>Priority: Critical
>  Labels: kip-848-client-support
> Fix For: 4.0.0
>
>
> In the AbstractHeartBeatRequestManager  , I observed we are not setting the 
> right heartbeat request interval. Is this intentional ?
> [~lianetm] [~kirktrue]
> {code:java}
> long retryBackoffMs = config.getLong(ConsumerConfig.RETRY_BACKOFF_MS_CONFIG);
> long retryBackoffMaxMs = 
> config.getLong(ConsumerConfig.RETRY_BACKOFF_MAX_MS_CONFIG);
> this.heartbeatRequestState = new HeartbeatRequestState(logContext, 
> time, 0, retryBackoffMs,
> retryBackoffMaxMs, maxPollIntervalMs);
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-17828) Reverse Checkpointing in MirrorMaker2

2024-10-18 Thread Daniel Urban (Jira)
Daniel Urban created KAFKA-17828:


 Summary: Reverse Checkpointing in MirrorMaker2
 Key: KAFKA-17828
 URL: https://issues.apache.org/jira/browse/KAFKA-17828
 Project: Kafka
  Issue Type: Improvement
  Components: mirrormaker
Reporter: Daniel Urban
Assignee: Daniel Urban


MM2 supports checkpointing - replicating the committed offsets of groups across 
Kafka clusters.

This is a one-way replication, meaning that consumers can rely on this when 
they fail over from the upstream cluster to the downstream cluster. But 
checkpointing would be desirable in failbacks, too: if the consumers processed 
messages from the downstream topic, they do not want to consume the same 
messages again in the upstream topic.

To avoid this, checkpointing should also support reverse checkpointing in the 
context of a bidirectional replication: creating checkpoints between 
downstream->upstream topics.

This ticket implements KIP-1098.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-17099) Improve the process exception logs with the exact processor node name in which processing exceptions occur

2024-10-18 Thread Bruno Cadonna (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-17099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Cadonna resolved KAFKA-17099.
---
Resolution: Fixed

> Improve the process exception logs with the exact processor node name in 
> which processing exceptions occur
> --
>
> Key: KAFKA-17099
> URL: https://issues.apache.org/jira/browse/KAFKA-17099
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Loïc Greffier
>Assignee: Loïc Greffier
>Priority: Minor
>
> h2. Current Behaviour
> When an exception occurs in a processor node, the task executor does not log 
> the actual processor node where the exception occurs.
>  
> For example, considering the following topology:
>  
> Topologies:
>    Sub-topology: 0
>     Source: KSTREAM-SOURCE-00 (topics: [PERSON_TOPIC])
>       --> KSTREAM-PEEK-01
>     Processor: KSTREAM-PEEK-01 (stores: [])
>       --> KSTREAM-MAP-02
>       <-- KSTREAM-SOURCE-00
>     Processor: KSTREAM-MAP-02 (stores: [])
>       --> KSTREAM-SINK-03
>       <-- KSTREAM-PEEK-01
>     Sink: KSTREAM-SINK-03 (topic: PERSON_MAP_TOPIC)
>       <-- KSTREAM-MAP-02
>  
> When an exception is thrown in the processor KSTREAM-MAP-02, the 
> following information will be logged by the task executor:
>  
> 2024-07-08T22:17:19.926+02:00  INFO 10552 — [-StreamThread-1] 
> i.g.l.s.map.app.KafkaStreamsTopology     : Received key = 0, value = \{"id": 
> 0, "firstName": "Ethan", "lastName": "Moore", "nationality": "CH", 
> "birthDate": "2011-02-21T15:45:12Z"}
> 2024-07-08T22:17:30.082+02:00 ERROR 10552 — [-StreamThread-1] 
> o.a.k.s.p.internals.TaskExecutor         : stream-thread 
> [streams-map-StreamThread-1] Failed to process stream task 0_0 due to the 
> following error:
>  
> org.apache.kafka.streams.errors.StreamsException: Exception caught in 
> process. taskId=0_0, processor=KSTREAM-SOURCE-00, topic=PERSON_TOPIC, 
> partition=0, offset=0, stacktrace=java.lang.RuntimeException: Something bad 
> happened...
> at 
> io.github.loicgreffier.streams.map.app.KafkaStreamsTopology.lambda$topology$1(KafkaStreamsTopology.java:33)
> at 
> org.apache.kafka.streams.kstream.internals.KStreamMap$KStreamMapProcessor.process(KStreamMap.java:46)
> at 
> org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:152)
> at 
> org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forwardInternal(ProcessorContextImpl.java:290)
> at 
> org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:269)
> at 
> org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:228)
> at 
> org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:215)
> at 
> org.apache.kafka.streams.kstream.internals.KStreamPeek$KStreamPeekProcessor.process(KStreamPeek.java:42)
> at 
> org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:154)
> at 
> org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forwardInternal(ProcessorContextImpl.java:290)
> at 
> org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:269)
> at 
> org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:228)
> at 
> org.apache.kafka.streams.processor.internals.SourceNode.process(SourceNode.java:84)
> at 
> org.apache.kafka.streams.processor.internals.StreamTask.lambda$doProcess$1(StreamTask.java:810)
> at 
> org.apache.kafka.streams.processor.internals.metrics.StreamsMetricsImpl.maybeMeasureLatency(StreamsMetricsImpl.java:872)
> at 
> org.apache.kafka.streams.processor.internals.StreamTask.doProcess(StreamTask.java:810)
> at 
> org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:741)
> at 
> org.apache.kafka.streams.processor.internals.TaskExecutor.processTask(TaskExecutor.java:97)
> at 
> org.apache.kafka.streams.processor.internals.TaskExecutor.process(TaskExecutor.java:78)
> at 
> org.apache.kafka.streams.processor.internals.TaskManager.process(TaskManager.java:1765)
> at 
> org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:807)
> at 
> org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:617)
> at 
> org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:579)
>  
> at 
> org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:767)
>  ~[kafka-streams-3.6.1.jar:na]
> at 
> org.apache.kafka.streams.processor.internals.TaskExecutor.processTask(TaskExecutor.java:97)
> 

[jira] [Created] (KAFKA-17829) Add an integration test verifying ShareFetch requests return a future on purgatory close

2024-10-18 Thread Abhinav Dixit (Jira)
Abhinav Dixit created KAFKA-17829:
-

 Summary: Add an integration test verifying ShareFetch requests 
return a future on purgatory close
 Key: KAFKA-17829
 URL: https://issues.apache.org/jira/browse/KAFKA-17829
 Project: Kafka
  Issue Type: Sub-task
Reporter: Abhinav Dixit


We need to add an integration test which verifies that on shutdown of the 
delayed share fetch purgatory, the share fetch requests which are present 
inside the purgatory return with an erroneous future. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-17809) Fix flaky test: testExplicitAcknowledgementCommitAsync

2024-10-18 Thread Apoorv Mittal (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-17809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apoorv Mittal resolved KAFKA-17809.
---
Resolution: Fixed

> Fix flaky test: testExplicitAcknowledgementCommitAsync
> --
>
> Key: KAFKA-17809
> URL: https://issues.apache.org/jira/browse/KAFKA-17809
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Apoorv Mittal
>Assignee: Apoorv Mittal
>Priority: Minor
>
> ShareConsumerTest: testExplicitAcknowledgementCommitAsync is flaky
> [https://github.com/apache/kafka/actions/runs/11354171046/job/31581087161]
>  
> {code:java}
> org.opentest4j.AssertionFailedError: expected: <1> but was: <0>
>   at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>   at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
>   at 
> app//org.junit.jupiter.api.AssertEquals.failNotEqual(AssertEquals.java:197)
>   at 
> app//org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:150)
>   at 
> app//org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:145)
>   at 
> app//org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:531)
>   at 
> app//kafka.test.api.ShareConsumerTest.testExplicitAcknowledgementCommitAsync(ShareConsumerTest.java:544)
>   at java.base@21.0.4/java.lang.reflect.Method.invoke(Method.java:580)
>   at java.base@21.0.4/java.util.ArrayList.forEach(ArrayList.java:1596)
>   at java.base@21.0.4/java.util.ArrayList.forEach(ArrayList.java:1596) 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-17830) Cover unit test for TBRLMM init failure cases

2024-10-18 Thread Kamal Chandraprakash (Jira)
Kamal Chandraprakash created KAFKA-17830:


 Summary: Cover unit test for TBRLMM init failure cases
 Key: KAFKA-17830
 URL: https://issues.apache.org/jira/browse/KAFKA-17830
 Project: Kafka
  Issue Type: Task
Reporter: Kamal Chandraprakash


[TopicBasedRemoteLogMetadataManagerTest|https://sourcegraph.com/github.com/apache/kafka/-/blob/storage/src/test/java/org/apache/kafka/server/log/remote/metadata/storage/TopicBasedRemoteLogMetadataManagerTest.java]
 does not cover initialization failure scenarios, it will good to cover those 
cases with unit tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-17831) Transaction coordinators returning COORDINATOR_LOAD_IN_PROGRESS until leader changes or brokers are restarted after network instability

2024-10-18 Thread Kay Hartmann (Jira)
Kay Hartmann created KAFKA-17831:


 Summary: Transaction coordinators returning 
COORDINATOR_LOAD_IN_PROGRESS until leader changes or brokers are restarted 
after network instability
 Key: KAFKA-17831
 URL: https://issues.apache.org/jira/browse/KAFKA-17831
 Project: Kafka
  Issue Type: Bug
  Components: core
Affects Versions: 3.7.1, 3.6.1
Reporter: Kay Hartmann


After experiencing a (heavy) network outage/instability, our brokers arrived in 
a state where some producers were not able to perform transactions, but the 
brokers continued to respond to those producers with  
`COORDINATOR_LOAD_IN_PROGRESS`. We were able to see corresponding DEBUG logs in 
the brokers:
{code:java}
2024-08-06 15:22:01,178 DEBUG [TransactionCoordinator id=11] Returning 
COORDINATOR_LOAD_IN_PROGRESS error code to client for my-client's AddPartitions 
request (kafka.coordinator.transaction.TransactionCoordinator) 
[data-plane-kafka-request-handler-5] {code}
This did not occur for all transactions, but for a subset of transactional ids 
with the same hash that would go through the same transaction 
coordinator/partition leader for the corresponding `__transaction_state` 
partition. We were able to resolve this the first time by shifting partition 
leaders for the transaction topic around and the second time by simply 
restarting brokers.

 

This lead us to believe that it has to be some kind of dirty in-memory state 
transaction coordinators have for a `__transaction_state` partition. We found 
two cases 
([#1|https://github.com/apache/kafka/blob/3.6.1/core/src/main/scala/kafka/coordinator/transaction/TransactionStateManager.scala#L319],
 
[#2|https://github.com/apache/kafka/blob/3.6.1/core/src/main/scala/kafka/coordinator/transaction/TransactionStateManager.scala#L376])
 in which the TransactionStateManager returns `COORDINATOR_LOAD_IN_PROGRESS`. 
In both cases `loadingPartitions` has some state that signals that the 
TransactionStateManager is still occupied with initializing transactional data 
for that `__transaction_state` partition.

We believe that the network outage caused partition leaders to be shifted 
around continuously between their replicas and somehow this lead to outdated 
data in `loadingPartitions` that wasn't cleaned up. I had a look at the 
[method|https://github.com/apache/kafka/blob/3.6.1/core/src/main/scala/kafka/coordinator/transaction/TransactionStateManager.scala#L518]
 where it is updated and cleaned, but wasn't able to identify a case in which 
there could be a failure to clean.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)