[jira] [Created] (KAFKA-14047) Use KafkaRaftManager in KRaft TestKit

2022-07-06 Thread dengziming (Jira)
dengziming created KAFKA-14047:
--

 Summary: Use KafkaRaftManager in KRaft TestKit
 Key: KAFKA-14047
 URL: https://issues.apache.org/jira/browse/KAFKA-14047
 Project: Kafka
  Issue Type: Test
Reporter: dengziming


We are using lower-level {{ControllerServer}} and {{BrokerServer}} in TestKit, 
we can improve it to use KafkaRaftManager.

see the discussion here: 
https://github.com/apache/kafka/pull/12157#discussion_r882179407



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14048) The Next Generation of the Consumer Rebalance Protocol

2022-07-06 Thread David Jacot (Jira)
David Jacot created KAFKA-14048:
---

 Summary: The Next Generation of the Consumer Rebalance Protocol
 Key: KAFKA-14048
 URL: https://issues.apache.org/jira/browse/KAFKA-14048
 Project: Kafka
  Issue Type: Improvement
Reporter: David Jacot
Assignee: David Jacot


This Jira tracks the development of KIP-848: 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-848%3A+The+Next+Generation+of+the+Consumer+Rebalance+Protocol.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[DISCUSS] KIP-848: The Next Generation of the Consumer Rebalance Protocol

2022-07-06 Thread David Jacot
Hi all,

I would like to start a discussion thread on KIP-848: The Next
Generation of the Consumer Rebalance Protocol. With this KIP, we aim
to make the rebalance protocol (for consumers) more reliable, more
scalable, easier to implement for clients, and easier to debug for
operators.

The KIP is here: https://cwiki.apache.org/confluence/x/HhD1D.

Please take a look and let me know what you think.

Best,
David

PS: I will be away from July 18th to August 8th. That gives you a bit
of time to read and digest this long KIP.


[jira] [Created] (KAFKA-14049) Relax Non Null Requirement for KStreamGlobalKTable Left Join

2022-07-06 Thread Saumya Gupta (Jira)
Saumya Gupta created KAFKA-14049:


 Summary: Relax Non Null Requirement for KStreamGlobalKTable Left 
Join
 Key: KAFKA-14049
 URL: https://issues.apache.org/jira/browse/KAFKA-14049
 Project: Kafka
  Issue Type: Improvement
  Components: streams
Reporter: Saumya Gupta


Null Values in the Stream for a Left Join would indicate a Tombstone Message 
that needs to propagated if not actually joined with the GlobalKTable message, 
hence these messages should not be ignored .



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [Vote] KIP-787 - MM2 Interface to manage Kafka resources

2022-07-06 Thread Omnia Ibrahim
Hi,
Can we have one last binding vote for this KIP, please?

Omnia

On Tue, Jun 28, 2022 at 3:36 PM Omnia Ibrahim 
wrote:

> Thanks, Tom, I have updated the KIP to reflect these minor points.
>
> On Tue, Jun 28, 2022 at 10:58 AM Tom Bentley  wrote:
>
>> Hi again Omnia,
>>
>> I left a couple of really minor points in the discussion thread. Assuming
>> you're happy with those I am +1 (binding).
>>
>> Thanks for your patience on this. Kind regards,
>>
>> Tom
>>
>> On Tue, 28 Jun 2022 at 10:14, Omnia Ibrahim 
>> wrote:
>>
>> > Hi,
>> > I did a small change to the KIP interface to address some discussions on
>> > the KIP. If all is good now can I get more votes on this, please?
>> >
>> > Thanks
>> > Omnia
>> >
>> > On Tue, Jun 21, 2022 at 10:34 AM Omnia Ibrahim > >
>> > wrote:
>> >
>> > > Hi,
>> > > Can I get more votes on this, please?
>> > >
>> > > Thanks
>> > >
>> > > On Sun, Jun 12, 2022 at 2:40 PM Federico Valeri > >
>> > > wrote:
>> > >
>> > >> Hi Omnia, this will be really useful, especially in cloud
>> environment.
>> > >>
>> > >> +1 non binding
>> > >>
>> > >> Thanks
>> > >> Fede
>> > >>
>> > >> On Tue, May 31, 2022 at 5:28 PM Mickael Maison <
>> > mickael.mai...@gmail.com>
>> > >> wrote:
>> > >> >
>> > >> > Hi Omnia,
>> > >> >
>> > >> > I think the approach you settled on is the best option, this will
>> > >> > allow integrating MirrorMaker in more environments.
>> > >> >
>> > >> > +1 binding
>> > >> >
>> > >> > Thanks for the KIP (and your persistence!)
>> > >> > Mickael
>> > >> >
>> > >> > On Mon, May 30, 2022 at 12:09 PM Omnia Ibrahim <
>> > o.g.h.ibra...@gmail.com>
>> > >> wrote:
>> > >> > >
>> > >> > > Hi,
>> > >> > > Can I get a vote on this, please?
>> > >> > > Thanks
>> > >> > >
>> > >> > > On Wed, May 25, 2022 at 11:15 PM Omnia Ibrahim <
>> > >> o.g.h.ibra...@gmail.com>
>> > >> > > wrote:
>> > >> > >
>> > >> > > > Hi,
>> > >> > > > I'd like to start a vote on KIP-787
>> > >> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-787
>> > >> > > > %3A+MM2+Interface+to+manage+Kafka+resources
>> > >> > > > <
>> > >>
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-787%3A+MM2+Interface+to+manage+Kafka+resources
>> > >> >
>> > >> > > >
>> > >> > > > Thanks
>> > >> > > > Omnia
>> > >> > > >
>> > >>
>> > >
>> >
>>
>


Transactions, delivery timeout and changing transactional producer behavior

2022-07-06 Thread Dániel Urbán
Hello everyone,

I've been investigating some transaction related issues in a very
problematic cluster. Besides finding some interesting issues, I had some
ideas about how transactional producer behavior could be improved.

My suggestion in short is: when the transactional producer encounters an
error which doesn't necessarily mean that the in-flight request was
processed (for example a client side timeout), the producer should not send
an EndTxnRequest on abort, but instead it should bump the producer epoch.

The long description about the issue I found, and how I came to the
suggestion:

First, the description of the issue. When I say that the cluster is "very
problematic", I mean all kinds of different issues, be it infra (disks and
network) or throughput (high volume producers without fine tuning).
In this cluster, Kafka transactions are widely used by many producers. And
in this cluster, partitions get "stuck" frequently (few times every week).

The exact meaning of a partition being "stuck" is this:

On the client side:
1. A transactional producer sends X batches to a partition in a single
transaction
2. Out of the X batches, the last few get sent, but are timed out thanks to
the delivery timeout config
3. producer.flush() is unblocked due to all batches being "finished"
4. Based on the errors reported in the producer.send() callback,
producer.abortTransaction() is called
5. Then producer.close() is also invoked with a 5s timeout (this
application does not reuse the producer instances optimally)
6. The transactional.id of the producer is never reused (it was random
generated)

On the partition leader side (what appears in the log segment of the
partition):
1. The batches sent by the producer are all appended to the log
2. But the ABORT marker of the transaction was appended before the last 1
or 2 batches of the transaction

On the transaction coordinator side (what appears in the transaction state
partition):
The transactional.id is present with the Empty state.

These happenings result in the following:
1. The partition leader handles the first batch after the ABORT marker as
the first message of a new transaction of the same producer id + epoch.
(LSO is blocked at this point)
2. The transaction coordinator is not aware of any in-progress transaction
of the producer, thus never aborting the transaction, not even after the
transaction.timeout.ms passes.

This is happening with Kafka 2.5 running in the cluster, producer versions
range between 2.0 and 2.6.
I scanned through a lot of tickets, and I believe that this issue is not
specific to these versions, and could happen with newest versions as well.
If I'm mistaken, some pointers would be appreciated.

Assuming that the issue could occur with any version, I believe this issue
boils down to one oversight on the client side:
When a request fails without a definitive response (e.g. a delivery
timeout), the client cannot assume that the request is "finished", and
simply abort the transaction. If the request is still in flight, and the
EndTxnRequest, then the WriteTxnMarkerRequest gets sent and processed
earlier, the contract is violated by the client.
This could be avoided by providing more information to the partition
leader. Right now, a new transactional batch signals the start of a new
transaction, and there is no way for the partition leader to decide whether
the batch is an out-of-order message.
In a naive and wasteful protocol, we could have a unique transaction id
added to each batch and marker, meaning that the leader would be capable of
refusing batches which arrive after the control marker of the transaction.
But instead of changing the log format and the protocol, we can achieve the
same by bumping the producer epoch.

Bumping the epoch has a similar effect to "changing the transaction id" -
the in-progress transaction will be aborted with a bumped producer epoch,
telling the partition leader about the producer epoch change. From this
point on, any batches sent with the old epoch will be refused by the leader
due to the fencing mechanism. It doesn't really matter how many batches
will get appended to the log, and how many will be refused - this is an
aborted transaction - but the out-of-order message cannot occur, and cannot
block the LSO infinitely.

My suggestion is, that the TransactionManager inside the producer should
keep track of what type of errors were encountered by the batches of the
transaction, and categorize them along the lines of "definitely completed"
and "might not be completed". When the transaction goes into an abortable
state, and there is at least one batch with "might not be completed", the
EndTxnRequest should be skipped, and an epoch bump should be sent.
As for what type of error counts as "might not be completed", I can only
think of client side timeouts.

I believe this is a relatively small change (only affects the client lib),
but it helps in avoiding some corrupt states in Kafka transactions.

Looking forward to 

Re: Transactions, delivery timeout and changing transactional producer behavior

2022-07-06 Thread Artem Livshits
Hi Daniel,

What you say makes sense.  Could you file a bug and put this info there so
that it's easier to track?

-Artem

On Wed, Jul 6, 2022 at 8:34 AM Dániel Urbán  wrote:

> Hello everyone,
>
> I've been investigating some transaction related issues in a very
> problematic cluster. Besides finding some interesting issues, I had some
> ideas about how transactional producer behavior could be improved.
>
> My suggestion in short is: when the transactional producer encounters an
> error which doesn't necessarily mean that the in-flight request was
> processed (for example a client side timeout), the producer should not send
> an EndTxnRequest on abort, but instead it should bump the producer epoch.
>
> The long description about the issue I found, and how I came to the
> suggestion:
>
> First, the description of the issue. When I say that the cluster is "very
> problematic", I mean all kinds of different issues, be it infra (disks and
> network) or throughput (high volume producers without fine tuning).
> In this cluster, Kafka transactions are widely used by many producers. And
> in this cluster, partitions get "stuck" frequently (few times every week).
>
> The exact meaning of a partition being "stuck" is this:
>
> On the client side:
> 1. A transactional producer sends X batches to a partition in a single
> transaction
> 2. Out of the X batches, the last few get sent, but are timed out thanks to
> the delivery timeout config
> 3. producer.flush() is unblocked due to all batches being "finished"
> 4. Based on the errors reported in the producer.send() callback,
> producer.abortTransaction() is called
> 5. Then producer.close() is also invoked with a 5s timeout (this
> application does not reuse the producer instances optimally)
> 6. The transactional.id of the producer is never reused (it was random
> generated)
>
> On the partition leader side (what appears in the log segment of the
> partition):
> 1. The batches sent by the producer are all appended to the log
> 2. But the ABORT marker of the transaction was appended before the last 1
> or 2 batches of the transaction
>
> On the transaction coordinator side (what appears in the transaction state
> partition):
> The transactional.id is present with the Empty state.
>
> These happenings result in the following:
> 1. The partition leader handles the first batch after the ABORT marker as
> the first message of a new transaction of the same producer id + epoch.
> (LSO is blocked at this point)
> 2. The transaction coordinator is not aware of any in-progress transaction
> of the producer, thus never aborting the transaction, not even after the
> transaction.timeout.ms passes.
>
> This is happening with Kafka 2.5 running in the cluster, producer versions
> range between 2.0 and 2.6.
> I scanned through a lot of tickets, and I believe that this issue is not
> specific to these versions, and could happen with newest versions as well.
> If I'm mistaken, some pointers would be appreciated.
>
> Assuming that the issue could occur with any version, I believe this issue
> boils down to one oversight on the client side:
> When a request fails without a definitive response (e.g. a delivery
> timeout), the client cannot assume that the request is "finished", and
> simply abort the transaction. If the request is still in flight, and the
> EndTxnRequest, then the WriteTxnMarkerRequest gets sent and processed
> earlier, the contract is violated by the client.
> This could be avoided by providing more information to the partition
> leader. Right now, a new transactional batch signals the start of a new
> transaction, and there is no way for the partition leader to decide whether
> the batch is an out-of-order message.
> In a naive and wasteful protocol, we could have a unique transaction id
> added to each batch and marker, meaning that the leader would be capable of
> refusing batches which arrive after the control marker of the transaction.
> But instead of changing the log format and the protocol, we can achieve the
> same by bumping the producer epoch.
>
> Bumping the epoch has a similar effect to "changing the transaction id" -
> the in-progress transaction will be aborted with a bumped producer epoch,
> telling the partition leader about the producer epoch change. From this
> point on, any batches sent with the old epoch will be refused by the leader
> due to the fencing mechanism. It doesn't really matter how many batches
> will get appended to the log, and how many will be refused - this is an
> aborted transaction - but the out-of-order message cannot occur, and cannot
> block the LSO infinitely.
>
> My suggestion is, that the TransactionManager inside the producer should
> keep track of what type of errors were encountered by the batches of the
> transaction, and categorize them along the lines of "definitely completed"
> and "might not be completed". When the transaction goes into an abortable
> state, and there is at least one batch with "might not be complete

Re: [VOTE] KIP-840: Config file option for MessageReader/MessageFormatter in ConsoleProducer/ConsoleConsumer

2022-07-06 Thread Colin McCabe
+1 (binding).

thanks, Alexandre.

On Mon, Jun 27, 2022, at 05:15, Alexandre Garnier wrote:
> Hello!
>
> A little ping on this vote.
>
> Thanks.
>
> Le jeu. 16 juin 2022 à 16:36, Alexandre Garnier  a écrit :
>
>> Hi everyone.
>>
>> Anyone wants to give a last binding vote for this KIP?
>>
>> Thanks.
>>
>> Le mar. 7 juin 2022 à 14:53, Alexandre Garnier  a
>> écrit :
>>
>>> Hi!
>>>
>>> A little reminder to vote for this KIP.
>>>
>>> Thanks.
>>>
>>>
>>> Le mer. 1 juin 2022 à 10:58, Alexandre Garnier  a
>>> écrit :
>>> >
>>> > Hi everyone!
>>> >
>>> > I propose to start voting for KIP-840:
>>> > https://cwiki.apache.org/confluence/x/bBqhD
>>> >
>>> > Thanks,
>>> > --
>>> > Alex
>>>
>>


Re: [Vote] KIP-787 - MM2 Interface to manage Kafka resources

2022-07-06 Thread David Jacot
Thanks for the KIP, Omnia! +1 (binding)

On Wed, Jul 6, 2022 at 5:02 PM Omnia Ibrahim  wrote:
>
> Hi,
> Can we have one last binding vote for this KIP, please?
>
> Omnia
>
> On Tue, Jun 28, 2022 at 3:36 PM Omnia Ibrahim 
> wrote:
>
> > Thanks, Tom, I have updated the KIP to reflect these minor points.
> >
> > On Tue, Jun 28, 2022 at 10:58 AM Tom Bentley  wrote:
> >
> >> Hi again Omnia,
> >>
> >> I left a couple of really minor points in the discussion thread. Assuming
> >> you're happy with those I am +1 (binding).
> >>
> >> Thanks for your patience on this. Kind regards,
> >>
> >> Tom
> >>
> >> On Tue, 28 Jun 2022 at 10:14, Omnia Ibrahim 
> >> wrote:
> >>
> >> > Hi,
> >> > I did a small change to the KIP interface to address some discussions on
> >> > the KIP. If all is good now can I get more votes on this, please?
> >> >
> >> > Thanks
> >> > Omnia
> >> >
> >> > On Tue, Jun 21, 2022 at 10:34 AM Omnia Ibrahim  >> >
> >> > wrote:
> >> >
> >> > > Hi,
> >> > > Can I get more votes on this, please?
> >> > >
> >> > > Thanks
> >> > >
> >> > > On Sun, Jun 12, 2022 at 2:40 PM Federico Valeri  >> >
> >> > > wrote:
> >> > >
> >> > >> Hi Omnia, this will be really useful, especially in cloud
> >> environment.
> >> > >>
> >> > >> +1 non binding
> >> > >>
> >> > >> Thanks
> >> > >> Fede
> >> > >>
> >> > >> On Tue, May 31, 2022 at 5:28 PM Mickael Maison <
> >> > mickael.mai...@gmail.com>
> >> > >> wrote:
> >> > >> >
> >> > >> > Hi Omnia,
> >> > >> >
> >> > >> > I think the approach you settled on is the best option, this will
> >> > >> > allow integrating MirrorMaker in more environments.
> >> > >> >
> >> > >> > +1 binding
> >> > >> >
> >> > >> > Thanks for the KIP (and your persistence!)
> >> > >> > Mickael
> >> > >> >
> >> > >> > On Mon, May 30, 2022 at 12:09 PM Omnia Ibrahim <
> >> > o.g.h.ibra...@gmail.com>
> >> > >> wrote:
> >> > >> > >
> >> > >> > > Hi,
> >> > >> > > Can I get a vote on this, please?
> >> > >> > > Thanks
> >> > >> > >
> >> > >> > > On Wed, May 25, 2022 at 11:15 PM Omnia Ibrahim <
> >> > >> o.g.h.ibra...@gmail.com>
> >> > >> > > wrote:
> >> > >> > >
> >> > >> > > > Hi,
> >> > >> > > > I'd like to start a vote on KIP-787
> >> > >> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-787
> >> > >> > > > %3A+MM2+Interface+to+manage+Kafka+resources
> >> > >> > > > <
> >> > >>
> >> >
> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-787%3A+MM2+Interface+to+manage+Kafka+resources
> >> > >> >
> >> > >> > > >
> >> > >> > > > Thanks
> >> > >> > > > Omnia
> >> > >> > > >
> >> > >>
> >> > >
> >> >
> >>
> >


Re: [VOTE] KIP-840: Config file option for MessageReader/MessageFormatter in ConsoleProducer/ConsoleConsumer

2022-07-06 Thread David Jacot
+1 (binding). Thanks for the KIP! It is a useful addition.

Best,
David

On Wed, Jul 6, 2022 at 9:10 PM Colin McCabe  wrote:
>
> +1 (binding).
>
> thanks, Alexandre.
>
> On Mon, Jun 27, 2022, at 05:15, Alexandre Garnier wrote:
> > Hello!
> >
> > A little ping on this vote.
> >
> > Thanks.
> >
> > Le jeu. 16 juin 2022 à 16:36, Alexandre Garnier  a écrit :
> >
> >> Hi everyone.
> >>
> >> Anyone wants to give a last binding vote for this KIP?
> >>
> >> Thanks.
> >>
> >> Le mar. 7 juin 2022 à 14:53, Alexandre Garnier  a
> >> écrit :
> >>
> >>> Hi!
> >>>
> >>> A little reminder to vote for this KIP.
> >>>
> >>> Thanks.
> >>>
> >>>
> >>> Le mer. 1 juin 2022 à 10:58, Alexandre Garnier  a
> >>> écrit :
> >>> >
> >>> > Hi everyone!
> >>> >
> >>> > I propose to start voting for KIP-840:
> >>> > https://cwiki.apache.org/confluence/x/bBqhD
> >>> >
> >>> > Thanks,
> >>> > --
> >>> > Alex
> >>>
> >>


[jira] [Created] (KAFKA-14050) Older clients cannot deserialize ApiVersions response with finalized feature epoch

2022-07-06 Thread Jason Gustafson (Jira)
Jason Gustafson created KAFKA-14050:
---

 Summary: Older clients cannot deserialize ApiVersions response 
with finalized feature epoch
 Key: KAFKA-14050
 URL: https://issues.apache.org/jira/browse/KAFKA-14050
 Project: Kafka
  Issue Type: Bug
Reporter: Jason Gustafson
 Fix For: 3.3.0


When testing kraft locally, we encountered this exception from an older client:
{code:java}
[ERROR] 2022-07-05 16:45:01,165 [kafka-admin-client-thread | adminclient-1394] 
org.apache.kafka.common.utils.KafkaThread lambda$configureThread$0 - Uncaught 
exception in thread 'kafka-admin-client-thread | admi
nclient-1394':
org.apache.kafka.common.protocol.types.SchemaException: Error reading field 
'api_keys': Error reading array of size 1207959552, only 579 bytes available
at org.apache.kafka.common.protocol.types.Schema.read(Schema.java:118)
at 
org.apache.kafka.common.protocol.ApiKeys.parseResponse(ApiKeys.java:378)
at 
org.apache.kafka.common.protocol.ApiKeys$1.parseResponse(ApiKeys.java:187)
at 
org.apache.kafka.clients.NetworkClient$DefaultClientInterceptor.parseResponse(NetworkClient.java:1333)
at 
org.apache.kafka.clients.NetworkClient.parseStructMaybeUpdateThrottleTimeMetrics(NetworkClient.java:752)
at 
org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:888)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:577)
at 
org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.processRequests(KafkaAdminClient.java:1329)
at 
org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.run(KafkaAdminClient.java:1260)
at java.base/java.lang.Thread.run(Thread.java:832) {code}
The cause appears to be from a change to the type of the 
`FinalizedFeaturesEpoch` field in the `ApiVersions` response from int32 to 
int64: 
[https://github.com/confluentinc/ce-kafka/commit/fb4f297207ef62f71e4a6d2d0dac75752933043d#diff-32006e8becae918416debdb9ac76bf8a1ad12b83aaaf5f8819b6ecc00c1fb56bL58.]

Fortunately, `FinalizedFeaturesEpoch` is a tagged field, so we can fix this by 
creating a new field. We will have to leave the existing tag in the protocol 
spec and consider it dead.

Credit for this find goes to [~dajac] .



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14032) Dequeue time for forwarded requests is ignored to set

2022-07-06 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-14032.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

> Dequeue time for forwarded requests is ignored to set
> -
>
> Key: KAFKA-14032
> URL: https://issues.apache.org/jira/browse/KAFKA-14032
> Project: Kafka
>  Issue Type: Bug
>Reporter: Feiyan Yu
>Priority: Minor
> Fix For: 3.3.0
>
>
> It seems like `requestDequeueTimeNanos` is ignored to set.
> As a property of a `Request object`, `requestDequeueTimeNanos` is set only 
> when handlers manage to poll and handle this request from `requestQueue`, 
> however, handlers only poll the request from envelop request once, but calls 
> handle method twice, which lead to an ignorance of `requestDequeueTimeNanos` 
> for parsed forwarded requests.
> The parsed envelop requests have `requestDequeueTimeNanos` = -1, and it 
> affect the correctness of statistics and metrics of `LocalTimeMs`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14051) KRaft remote controllers do not create metrics reporters

2022-07-06 Thread Ron Dagostino (Jira)
Ron Dagostino created KAFKA-14051:
-

 Summary: KRaft remote controllers do not create metrics reporters
 Key: KAFKA-14051
 URL: https://issues.apache.org/jira/browse/KAFKA-14051
 Project: Kafka
  Issue Type: Bug
  Components: kraft
Affects Versions: 3.3
Reporter: Ron Dagostino


KRaft remote controllers (KRaft nodes with the configuration value 
process.roles=controller) do not create the configured metrics reporters 
defined by the configuration key metric.reporters.  The reason is because KRaft 
remote controllers are not wired up for dynamic config changes, and the 
creation of the configured metric reporters actually happens during the wiring 
up of the broker for dynamic reconfiguration, in the invocation of 
DynamicBrokerConfig.addReconfigurables(KafkaBroker).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14052) Download verification directions are incorrect for linux

2022-07-06 Thread M Sesterhenn (Jira)
M Sesterhenn created KAFKA-14052:


 Summary: Download verification directions are incorrect for linux
 Key: KAFKA-14052
 URL: https://issues.apache.org/jira/browse/KAFKA-14052
 Project: Kafka
  Issue Type: Bug
  Components: documentation
 Environment: website
Reporter: M Sesterhenn


[https://www.apache.org/info/verification.html]

The above is linked to from the kafka download page 
([https://kafka.apache.org/downloads]), and it contains incorrect instructions 
for verifying the release.

The .sha512 files for the downloads are all in this format:

 
{code:java}
kafka_2.13-3.2.0.tgz: 736A1298 23B058DC 10788D08 93BDE47B 6F39B9E4 972F9EAC 
2D5C9E85 E51E4773 44C6F1E1 EBD126CE 34D5FD43 0EB07E55 FDD60D60 CB541F1D 
48655C0E BC0A4778 
{code}
These files cannot be used to easily verify the expected hash using the 
procedure described in the verification website.  The website says to use:
{code:java}
sha512sum file {code}
...which doesn't do any hash comparison; it only tells you what the file's hash 
is, and it is up to the user to manually compare its output with the 
differently formatted output in the .sha512 file, which is error-prone and a 
chore.

Expected result:

I would expect to be able to do 
{code:java}
sha512sum -c file{code}
...like any normal download.

 

If the format of the .sha512 files cannot be changed to be compatible with the 
linux shasum program, then please update the website to describe the proper way 
to compare hashes.  The best way seems to be a script like this:

 
{code:java}
SHA=$(mktemp); gpg --print-md SHA512 $FILE > $SHA && diff $SHA $FILE.sha512 && 
echo "SHA checks out OK."
{code}
(where FILE is the downloaded tarball.)

I looked into providing a PR for the verification page, but that is an 
Apache-wide web page and probably is not publicly available.

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] KIP-851: : Add requireStable flag into ListConsumerGroupOffsetsOptions

2022-07-06 Thread Guozhang Wang
Hello folks,

I tried to implement the script-extension along with the unit test coverage
on DescribeConsumerGroupTest, but it turns out more complicated than I
anticipated due to the fact that we need to make sure the `require-stable`
flag is only effective for describing consumers. This also makes me feeling
that maybe my original thought of just adding that option for "--describe"
may not be comprehensive and we may need to think through under which
action the additional option should be allowed. So I would remove this part
of the KIP and continue the voting thread.

I've updated the KIP addressing the renaming suggestion. And I will close
this thread as accepted with three binding votes (John, David, Bruno) if I
don't hear from you for more suggestions.

Thanks,
Guozhang



On Mon, Jul 4, 2022 at 1:57 AM Bruno Cadonna  wrote:

> Hi,
>
> I would prefer to not include the script-extension into the KIP if you
> you cannot commit to its implementation. I think partially implemented
> KIPs make release management harder. If we can avoid implementing KIPs
> partially, we should do it.
>
> I am +1 either way. I just wanted to bring this up.
>
> Best,
> Bruno
>
> On 04.07.22 04:37, Luke Chen wrote:
> > Hi Guozhang,
> >
> >> We can add it into this proposal though I could not commit to
> implementing
> > it myself with all the `DescribeConsumerGroupTest` coverage after it's
> > accepted, instead I could add a JIRA ticket under this KIP for others
> who's
> > interested to chime in. What do you think?
> >
> > Sounds good to me.
> >
> > Thank you.
> > Luke
> >
> > On Mon, Jul 4, 2022 at 3:44 AM Guozhang Wang  wrote:
> >
> >> Thanks for folks for your input !
> >>
> >> 1) I'm happy to change the setter names to be consistent with the
> >> topicPartition ones. I used a different name for getter from setter as I
> >> remember seeing some other options differentiating function names for
> >> getter and setters, while some other options seem to be more on just
> >> keeping the names the same. After getting your feedback I think it's
> better
> >> to do the same for both getter / setters.
> >>
> >> 2) For the kafka-consumer-group.sh tool, I looked at
> >> ConsumerGroupCommand#describeGroups, and I think it's appropriate to add
> >> it. I'm planning to add it in the shell tool as:
> >>
> >> ```
> >> "Require brokers to hold on returning unstable offsets (due to pending
> >> transactions) but retry until timeout for stably committed offsets"
> >> "Example: --bootstrap-server localhost:9092 --describe --group group1
> >> --offsets --require-stable"
> >> ```
> >>
> >> We can add it into this proposal though I could not commit to
> implementing
> >> it myself with all the `DescribeConsumerGroupTest` coverage after it's
> >> accepted, instead I could add a JIRA ticket under this KIP for others
> who's
> >> interested to chime in. What do you think?
> >>
> >>
> >> Guozhang
> >>
> >>
> >> On Fri, Jul 1, 2022 at 1:18 AM David Jacot  >
> >> wrote:
> >>
> >>> Hi Guozhang,
> >>>
> >>> Thanks for the KIP!
> >>>
> >>> I agree with Luke. `requireStable` seems more consistent.
> >>>
> >>> Regarding the kafka-consumer-group command line tool, I wonder if
> >>> there is real value in doing it. We don't necessarily have to add all
> >>> the options to it but we could if it is proven to be useful. Anyway, I
> >>> would leave it for a future KIP.
> >>>
> >>> +1 (binding)
> >>>
> >>> Best,
> >>> David
> >>>
> >>> On Fri, Jul 1, 2022 at 9:47 AM Bruno Cadonna 
> wrote:
> 
>  Hi Guozhang,
> 
>  thank you for the KIP!
> 
>  I do not have strong feelings about the naming of the getter, but I
> >> tend
>  to agree with Luke.
> 
>  Regarding, the adaptation of the kafka-consumer-group.sh script, I am
>  fine if we leave that for a future KIP.
> 
>  +1 (binding)
> 
>  Best,
>  Bruno
> 
>  On 01.07.22 06:05, Luke Chen wrote:
> > Hi Guozhang,
> >
> > Thanks for the KIP.
> > Some comments:
> > 1. I have the same question as Ziming, should we also add an option
> >> in
> > kafka-consumer-groups.sh in this KIP?
> > Or you'd like to keep the current scope, and other people can create
> >> a
> > follow-up KIP to address the kafka-consumer-groups.sh script?
> > 2. The setter method name: `shouldRequireStable` might need to rename
> >>> to
> > `requireStable` to be consistent with above `topicPartitions`
> >>> getter/setter
> >
> > Thank you.
> > Luke
> >
> > On Fri, Jul 1, 2022 at 11:17 AM John Roesler 
> >>> wrote:
> >
> >> Thanks for the KIP, Guozhang!
> >>
> >> I’m +1 (binding)
> >>
> >> -John
> >>
> >> On Thu, Jun 30, 2022, at 21:17, deng ziming wrote:
> >>> Thanks for this KIP,
> >>> we have a kafka-consumer-groups.sh shell which is based on the API
> >>> you
> >>> proposed to change, is it worth update it as well?
> >>>
> >>> --
> >>> Best,
> >>> Ziming
> >>>
> 

Re: [VOTE] KIP-851: : Add requireStable flag into ListConsumerGroupOffsetsOptions

2022-07-06 Thread Luke Chen
Hi Guozhang,

Removing the script-extention from the KIP is good to me.
+1 from me, too.

Thank you.
Luke

On Thu, Jul 7, 2022 at 7:39 AM Guozhang Wang  wrote:

> Hello folks,
>
> I tried to implement the script-extension along with the unit test coverage
> on DescribeConsumerGroupTest, but it turns out more complicated than I
> anticipated due to the fact that we need to make sure the `require-stable`
> flag is only effective for describing consumers. This also makes me feeling
> that maybe my original thought of just adding that option for "--describe"
> may not be comprehensive and we may need to think through under which
> action the additional option should be allowed. So I would remove this part
> of the KIP and continue the voting thread.
>
> I've updated the KIP addressing the renaming suggestion. And I will close
> this thread as accepted with three binding votes (John, David, Bruno) if I
> don't hear from you for more suggestions.
>
> Thanks,
> Guozhang
>
>
>
> On Mon, Jul 4, 2022 at 1:57 AM Bruno Cadonna  wrote:
>
> > Hi,
> >
> > I would prefer to not include the script-extension into the KIP if you
> > you cannot commit to its implementation. I think partially implemented
> > KIPs make release management harder. If we can avoid implementing KIPs
> > partially, we should do it.
> >
> > I am +1 either way. I just wanted to bring this up.
> >
> > Best,
> > Bruno
> >
> > On 04.07.22 04:37, Luke Chen wrote:
> > > Hi Guozhang,
> > >
> > >> We can add it into this proposal though I could not commit to
> > implementing
> > > it myself with all the `DescribeConsumerGroupTest` coverage after it's
> > > accepted, instead I could add a JIRA ticket under this KIP for others
> > who's
> > > interested to chime in. What do you think?
> > >
> > > Sounds good to me.
> > >
> > > Thank you.
> > > Luke
> > >
> > > On Mon, Jul 4, 2022 at 3:44 AM Guozhang Wang 
> wrote:
> > >
> > >> Thanks for folks for your input !
> > >>
> > >> 1) I'm happy to change the setter names to be consistent with the
> > >> topicPartition ones. I used a different name for getter from setter
> as I
> > >> remember seeing some other options differentiating function names for
> > >> getter and setters, while some other options seem to be more on just
> > >> keeping the names the same. After getting your feedback I think it's
> > better
> > >> to do the same for both getter / setters.
> > >>
> > >> 2) For the kafka-consumer-group.sh tool, I looked at
> > >> ConsumerGroupCommand#describeGroups, and I think it's appropriate to
> add
> > >> it. I'm planning to add it in the shell tool as:
> > >>
> > >> ```
> > >> "Require brokers to hold on returning unstable offsets (due to pending
> > >> transactions) but retry until timeout for stably committed offsets"
> > >> "Example: --bootstrap-server localhost:9092 --describe --group group1
> > >> --offsets --require-stable"
> > >> ```
> > >>
> > >> We can add it into this proposal though I could not commit to
> > implementing
> > >> it myself with all the `DescribeConsumerGroupTest` coverage after it's
> > >> accepted, instead I could add a JIRA ticket under this KIP for others
> > who's
> > >> interested to chime in. What do you think?
> > >>
> > >>
> > >> Guozhang
> > >>
> > >>
> > >> On Fri, Jul 1, 2022 at 1:18 AM David Jacot
>  > >
> > >> wrote:
> > >>
> > >>> Hi Guozhang,
> > >>>
> > >>> Thanks for the KIP!
> > >>>
> > >>> I agree with Luke. `requireStable` seems more consistent.
> > >>>
> > >>> Regarding the kafka-consumer-group command line tool, I wonder if
> > >>> there is real value in doing it. We don't necessarily have to add all
> > >>> the options to it but we could if it is proven to be useful. Anyway,
> I
> > >>> would leave it for a future KIP.
> > >>>
> > >>> +1 (binding)
> > >>>
> > >>> Best,
> > >>> David
> > >>>
> > >>> On Fri, Jul 1, 2022 at 9:47 AM Bruno Cadonna 
> > wrote:
> > 
> >  Hi Guozhang,
> > 
> >  thank you for the KIP!
> > 
> >  I do not have strong feelings about the naming of the getter, but I
> > >> tend
> >  to agree with Luke.
> > 
> >  Regarding, the adaptation of the kafka-consumer-group.sh script, I
> am
> >  fine if we leave that for a future KIP.
> > 
> >  +1 (binding)
> > 
> >  Best,
> >  Bruno
> > 
> >  On 01.07.22 06:05, Luke Chen wrote:
> > > Hi Guozhang,
> > >
> > > Thanks for the KIP.
> > > Some comments:
> > > 1. I have the same question as Ziming, should we also add an option
> > >> in
> > > kafka-consumer-groups.sh in this KIP?
> > > Or you'd like to keep the current scope, and other people can
> create
> > >> a
> > > follow-up KIP to address the kafka-consumer-groups.sh script?
> > > 2. The setter method name: `shouldRequireStable` might need to
> rename
> > >>> to
> > > `requireStable` to be consistent with above `topicPartitions`
> > >>> getter/setter
> > >
> > > Thank you.
> > > Luke
> > >
> > > On Fri, Jul 1, 2022 at 11:1

Re: Transactions, delivery timeout and changing transactional producer behavior

2022-07-06 Thread Luke Chen
Hi Daniel,

Thanks for reporting the issue, and the investigation.
I'm curious, so, what's your workaround for this issue?

I agree with Artem, it makes sense. Please file a bug in JIRA.
And looking forward to your PR! :)

Thank you.
Luke

On Thu, Jul 7, 2022 at 3:07 AM Artem Livshits
 wrote:

> Hi Daniel,
>
> What you say makes sense.  Could you file a bug and put this info there so
> that it's easier to track?
>
> -Artem
>
> On Wed, Jul 6, 2022 at 8:34 AM Dániel Urbán  wrote:
>
> > Hello everyone,
> >
> > I've been investigating some transaction related issues in a very
> > problematic cluster. Besides finding some interesting issues, I had some
> > ideas about how transactional producer behavior could be improved.
> >
> > My suggestion in short is: when the transactional producer encounters an
> > error which doesn't necessarily mean that the in-flight request was
> > processed (for example a client side timeout), the producer should not
> send
> > an EndTxnRequest on abort, but instead it should bump the producer epoch.
> >
> > The long description about the issue I found, and how I came to the
> > suggestion:
> >
> > First, the description of the issue. When I say that the cluster is "very
> > problematic", I mean all kinds of different issues, be it infra (disks
> and
> > network) or throughput (high volume producers without fine tuning).
> > In this cluster, Kafka transactions are widely used by many producers.
> And
> > in this cluster, partitions get "stuck" frequently (few times every
> week).
> >
> > The exact meaning of a partition being "stuck" is this:
> >
> > On the client side:
> > 1. A transactional producer sends X batches to a partition in a single
> > transaction
> > 2. Out of the X batches, the last few get sent, but are timed out thanks
> to
> > the delivery timeout config
> > 3. producer.flush() is unblocked due to all batches being "finished"
> > 4. Based on the errors reported in the producer.send() callback,
> > producer.abortTransaction() is called
> > 5. Then producer.close() is also invoked with a 5s timeout (this
> > application does not reuse the producer instances optimally)
> > 6. The transactional.id of the producer is never reused (it was random
> > generated)
> >
> > On the partition leader side (what appears in the log segment of the
> > partition):
> > 1. The batches sent by the producer are all appended to the log
> > 2. But the ABORT marker of the transaction was appended before the last 1
> > or 2 batches of the transaction
> >
> > On the transaction coordinator side (what appears in the transaction
> state
> > partition):
> > The transactional.id is present with the Empty state.
> >
> > These happenings result in the following:
> > 1. The partition leader handles the first batch after the ABORT marker as
> > the first message of a new transaction of the same producer id + epoch.
> > (LSO is blocked at this point)
> > 2. The transaction coordinator is not aware of any in-progress
> transaction
> > of the producer, thus never aborting the transaction, not even after the
> > transaction.timeout.ms passes.
> >
> > This is happening with Kafka 2.5 running in the cluster, producer
> versions
> > range between 2.0 and 2.6.
> > I scanned through a lot of tickets, and I believe that this issue is not
> > specific to these versions, and could happen with newest versions as
> well.
> > If I'm mistaken, some pointers would be appreciated.
> >
> > Assuming that the issue could occur with any version, I believe this
> issue
> > boils down to one oversight on the client side:
> > When a request fails without a definitive response (e.g. a delivery
> > timeout), the client cannot assume that the request is "finished", and
> > simply abort the transaction. If the request is still in flight, and the
> > EndTxnRequest, then the WriteTxnMarkerRequest gets sent and processed
> > earlier, the contract is violated by the client.
> > This could be avoided by providing more information to the partition
> > leader. Right now, a new transactional batch signals the start of a new
> > transaction, and there is no way for the partition leader to decide
> whether
> > the batch is an out-of-order message.
> > In a naive and wasteful protocol, we could have a unique transaction id
> > added to each batch and marker, meaning that the leader would be capable
> of
> > refusing batches which arrive after the control marker of the
> transaction.
> > But instead of changing the log format and the protocol, we can achieve
> the
> > same by bumping the producer epoch.
> >
> > Bumping the epoch has a similar effect to "changing the transaction id" -
> > the in-progress transaction will be aborted with a bumped producer epoch,
> > telling the partition leader about the producer epoch change. From this
> > point on, any batches sent with the old epoch will be refused by the
> leader
> > due to the fencing mechanism. It doesn't really matter how many batches
> > will get appended to the log, and how many will 

Build failed in Jenkins: Kafka » Kafka Branch Builder » trunk #1053

2022-07-06 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 401718 lines...]
[2022-07-07T04:15:19.177Z] 
[2022-07-07T04:15:19.177Z] 
org.apache.kafka.streams.processor.internals.StreamsAssignmentScaleTest > 
testFallbackPriorTaskAssignorManyStandbys PASSED
[2022-07-07T04:15:19.177Z] 
[2022-07-07T04:15:19.177Z] 
org.apache.kafka.streams.processor.internals.StreamsAssignmentScaleTest > 
testHighAvailabilityTaskAssignorManyStandbys STARTED
[2022-07-07T04:15:20.192Z] 
[2022-07-07T04:15:20.192Z] 
org.apache.kafka.streams.integration.StoreQueryIntegrationTest > 
shouldQuerySpecificActivePartitionStores PASSED
[2022-07-07T04:15:20.192Z] 
[2022-07-07T04:15:20.192Z] 
org.apache.kafka.streams.integration.StoreQueryIntegrationTest > 
shouldQueryAllStalePartitionStores STARTED
[2022-07-07T04:15:25.904Z] 
[2022-07-07T04:15:25.904Z] 
org.apache.kafka.streams.integration.StoreQueryIntegrationTest > 
shouldQueryAllStalePartitionStores PASSED
[2022-07-07T04:15:25.904Z] 
[2022-07-07T04:15:25.904Z] 
org.apache.kafka.streams.integration.StoreQueryIntegrationTest > 
shouldQuerySpecificStalePartitionStoresMultiStreamThreads STARTED
[2022-07-07T04:15:30.651Z] 
[2022-07-07T04:15:30.651Z] 
org.apache.kafka.streams.integration.StoreQueryIntegrationTest > 
shouldQuerySpecificStalePartitionStoresMultiStreamThreads PASSED
[2022-07-07T04:15:30.651Z] 
[2022-07-07T04:15:30.652Z] 
org.apache.kafka.streams.integration.StoreQueryIntegrationTest > 
shouldQuerySpecificStalePartitionStores STARTED
[2022-07-07T04:15:36.359Z] 
[2022-07-07T04:15:36.359Z] 
org.apache.kafka.streams.integration.StoreQueryIntegrationTest > 
shouldQuerySpecificStalePartitionStores PASSED
[2022-07-07T04:15:36.359Z] 
[2022-07-07T04:15:36.359Z] 
org.apache.kafka.streams.integration.StoreQueryIntegrationTest > 
shouldQueryOnlyActivePartitionStoresByDefault STARTED
[2022-07-07T04:15:43.854Z] 
[2022-07-07T04:15:43.854Z] 
org.apache.kafka.streams.integration.StoreQueryIntegrationTest > 
shouldQueryOnlyActivePartitionStoresByDefault PASSED
[2022-07-07T04:15:43.854Z] 
[2022-07-07T04:15:43.854Z] 
org.apache.kafka.streams.integration.StoreQueryIntegrationTest > 
shouldQueryStoresAfterAddingAndRemovingStreamThread STARTED
[2022-07-07T04:15:55.002Z] 
[2022-07-07T04:15:55.002Z] 
org.apache.kafka.streams.integration.StoreQueryIntegrationTest > 
shouldQueryStoresAfterAddingAndRemovingStreamThread PASSED
[2022-07-07T04:15:55.002Z] 
[2022-07-07T04:15:55.002Z] 
org.apache.kafka.streams.integration.StoreQueryIntegrationTest > 
shouldQuerySpecificStalePartitionStoresMultiStreamThreadsNamedTopology STARTED
[2022-07-07T04:16:01.409Z] 
[2022-07-07T04:16:01.409Z] 
org.apache.kafka.streams.integration.StoreQueryIntegrationTest > 
shouldQuerySpecificStalePartitionStoresMultiStreamThreadsNamedTopology PASSED
[2022-07-07T04:16:01.409Z] 
[2022-07-07T04:16:01.409Z] 
org.apache.kafka.streams.integration.StreamStreamJoinIntegrationTest > 
testInnerRepartitioned[caching enabled = true] STARTED
[2022-07-07T04:16:04.388Z] 
[2022-07-07T04:16:04.388Z] 
org.apache.kafka.streams.processor.internals.StreamsAssignmentScaleTest > 
testHighAvailabilityTaskAssignorManyStandbys PASSED
[2022-07-07T04:16:04.388Z] 
[2022-07-07T04:16:04.388Z] 
org.apache.kafka.streams.processor.internals.StreamsAssignmentScaleTest > 
testFallbackPriorTaskAssignorLargeNumConsumers STARTED
[2022-07-07T04:16:07.329Z] 
[2022-07-07T04:16:07.329Z] 
org.apache.kafka.streams.processor.internals.StreamsAssignmentScaleTest > 
testFallbackPriorTaskAssignorLargeNumConsumers PASSED
[2022-07-07T04:16:07.329Z] 
[2022-07-07T04:16:07.329Z] 
org.apache.kafka.streams.processor.internals.StreamsAssignmentScaleTest > 
testStickyTaskAssignorLargeNumConsumers STARTED
[2022-07-07T04:16:09.945Z] 
[2022-07-07T04:16:09.945Z] 
org.apache.kafka.streams.processor.internals.StreamsAssignmentScaleTest > 
testStickyTaskAssignorLargeNumConsumers PASSED
[2022-07-07T04:16:10.958Z] streams-7: SMOKE-TEST-CLIENT-CLOSED
[2022-07-07T04:16:10.958Z] streams-5: SMOKE-TEST-CLIENT-CLOSED
[2022-07-07T04:16:10.958Z] streams-3: SMOKE-TEST-CLIENT-CLOSED
[2022-07-07T04:16:10.958Z] streams-0: SMOKE-TEST-CLIENT-CLOSED
[2022-07-07T04:16:10.958Z] streams-6: SMOKE-TEST-CLIENT-CLOSED
[2022-07-07T04:16:10.958Z] streams-4: SMOKE-TEST-CLIENT-CLOSED
[2022-07-07T04:16:10.958Z] streams-1: SMOKE-TEST-CLIENT-CLOSED
[2022-07-07T04:16:10.958Z] streams-2: SMOKE-TEST-CLIENT-CLOSED
[2022-07-07T04:16:13.565Z] 
[2022-07-07T04:16:13.565Z] 
org.apache.kafka.streams.integration.StreamStreamJoinIntegrationTest > 
testInnerRepartitioned[caching enabled = true] PASSED
[2022-07-07T04:16:13.565Z] 
[2022-07-07T04:16:13.565Z] 
org.apache.kafka.streams.integration.StreamStreamJoinIntegrationTest > 
testOuterRepartitioned[caching enabled = true] STARTED
[2022-07-07T04:16:29.984Z] 
[2022-07-07T04:16:29.984Z] 
org.apache.kafka.streams.integration.StreamStreamJoinIntegrationTest > 
testOuterRepartitioned[caching enabled = t