[jira] [Created] (KAFKA-16026) AsyncConsumer does not send a poll event to the background thread

2023-12-18 Thread Philip Nee (Jira)
Philip Nee created KAFKA-16026:
--

 Summary: AsyncConsumer does not send a poll event to the 
background thread
 Key: KAFKA-16026
 URL: https://issues.apache.org/jira/browse/KAFKA-16026
 Project: Kafka
  Issue Type: Bug
  Components: consumer
Reporter: Philip Nee


consumer poll does not send a poll event to the background thread to:
 # trigger autocommit
 # reset max poll interval timer

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16027) Refactor MetadataTest#testUpdatePartitionLeadership

2023-12-18 Thread Philip Nee (Jira)
Philip Nee created KAFKA-16027:
--

 Summary: Refactor MetadataTest#testUpdatePartitionLeadership
 Key: KAFKA-16027
 URL: https://issues.apache.org/jira/browse/KAFKA-16027
 Project: Kafka
  Issue Type: Improvement
Reporter: Philip Nee


MetadataTest#testUpdatePartitionLeadership is extremely long.  I think it is 
pretty close to the 160 line method limit.  The test also contains two tests, 
so it is best to split it into two separate tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16028) AdminClient fails to describe consumer group

2023-12-18 Thread Jira
Ömer Şiar Baysal created KAFKA-16028:


 Summary: AdminClient fails to describe consumer group
 Key: KAFKA-16028
 URL: https://issues.apache.org/jira/browse/KAFKA-16028
 Project: Kafka
  Issue Type: Bug
  Components: admin, clients, consumer, log
Affects Versions: 3.6.1, 2.8.2
Reporter: Ömer Şiar Baysal


Dear Team,

We have been investigating some quirky behavior around admin client.  Here is 
our conclusion:

- Due to some bug (or a feature not known by us) AdminClient (both 2.8 and 3.6) 
fails to describe one of the consumer groups (with no known problems about it)
- Pure GoLang admin client does not have the problem (github.com/twmb/franz-go) 
and able to describe the consumer group.

We tried to understand what may cause the issue, first of all, the Java client 
2.8 reported,


kafka-consumer-groups --bootstrap-server broker:9092 --describe --group 
'problematic-consumer'
Error: Executing consumer group command failed due to 
org.apache.kafka.common.errors.LeaderNotAvailableException: There is no leader 
for this topic-partition as we are in the middle of a leadership election.
java.util.concurrent.ExecutionException: 
org.apache.kafka.common.errors.LeaderNotAvailableException: There is no leader 
for this topic-partition as we are in the middle of a leadership election.

we waited if this is a transient error but it turned out it is not, there was 
no election for the given topic

But it was not clear which topic admin client was talking about so TRACE log 
revealed some more information:


[2023-12-18 10:36:38,434] DEBUG [AdminClient clientId=adminclient-1] Sending 
LIST_OFFSETS request with header RequestHeader(apiKey=LIST_OFFSETS, 
apiVersion=6, clientId=adminclient-1, correlationId=30) and timeout 4997 to 
node 40: ListOffsetsRequestData(replicaId=-1, isolationLevel=0, 
topics=[ListOffsetsTopic(name='problematic-topic', 
partitions=[ListOffsetsPartition(partitionIndex=4, currentLeaderEpoch=-1, 
timestamp=-1, maxNumOffsets=1), ListOffsetsPartition(partitionIndex=5, 
currentLeaderEpoch=-1, timestamp=-1, maxNumOffsets=1)])]) 
(org.apache.kafka.clients.NetworkClient)
[2023-12-18 10:36:38,434] TRACE [AdminClient clientId=adminclient-1] Entering 
KafkaClient#poll(timeout=4997) (org.apache.kafka.clients.admin.KafkaAdminClient)
[2023-12-18 10:36:38,435] TRACE [AdminClient clientId=adminclient-1] 
KafkaClient#poll retrieved 0 response(s) 
(org.apache.kafka.clients.admin.KafkaAdminClient)
[2023-12-18 10:36:38,435] TRACE [AdminClient clientId=adminclient-1] Trying to 
choose nodes for [] at 1702884998435 
(org.apache.kafka.clients.admin.KafkaAdminClient)
[2023-12-18 10:36:38,435] TRACE [AdminClient clientId=adminclient-1] Entering 
KafkaClient#poll(timeout=4995) (org.apache.kafka.clients.admin.KafkaAdminClient)

Error: Executing consumer group command failed due to 
org.apache.kafka.common.errors.LeaderNotAvailableException: There is no leader 
for this topic-partition as we are in the middle of a leadership election.
[2023-12-18 10:36:38,436] DEBUG [AdminClient clientId=adminclient-1] Received 
LIST_OFFSETS response from node 40 for request with header 
RequestHeader(apiKey=LIST_OFFSETS, apiVersion=6, clientId=adminclient-1, 
correlationId=30): ListOffsetsResponseData(throttleTimeMs=0, 
topics=[ListOffsetsTopicResponse(name='problematic-topic', 
partitions=[ListOffsetsPartitionResponse(partitionIndex=5, errorCode=0, 
oldStyleOffsets=[], timestamp=-1, offset=822516, leaderEpoch=113, 
followerRestorePointObjectId=AA, 
followerRestorePointEpoch=0), ListOffsetsPartitionResponse(partitionIndex=4, 
errorCode=0, oldStyleOffsets=[], timestamp=-1, offset=827297, leaderEpoch=93, 
followerRestorePointObjectId=AA, 
followerRestorePointEpoch=0)])]) (org.apache.kafka.clients.NetworkClient)
[2023-12-18 10:36:38,436] TRACE [AdminClient clientId=adminclient-1] 
KafkaClient#poll retrieved 1 response(s) 
(org.apache.kafka.clients.admin.KafkaAdminClient)
[2023-12-18 10:36:38,437] TRACE [AdminClient clientId=adminclient-1] 
Call(callName=listOffsets on broker 40, deadlineMs=1702885003430, tries=0, 
nextAllowedTryMs=0) got response ListOffsetsResponseData(throttleTimeMs=0, 
topics=[ListOffsetsTopicResponse(name='problematic-topic', 
partitions=[ListOffsetsPartitionResponse(partitionIndex=5, errorCode=0, 
oldStyleOffsets=[], timestamp=-1, offset=822516, leaderEpoch=113, 
followerRestorePointObjectId=AA, 
followerRestorePointEpoch=0), ListOffsetsPartitionResponse(partitionIndex=4, 
errorCode=0, oldStyleOffsets=[], timestamp=-1, offset=827297, leaderEpoch=93, 
followerRestorePointObjectId=AA, 
followerRestorePointEpoch=0)])]) 
(org.apache.kafka.clients.admin.KafkaAdminClient)
[2023-12-18 10:36:38,437] TRACE [AdminClient clientId=adminclient-1] Trying to 
choose nodes for [] at 1702884998436 
(org.apache.kafka.clients.admin.Kaf

Re: [DISCUSS] Road to Kafka 4.0

2023-12-18 Thread Luke Chen
Hi all,

We found that currently (the latest trunk branch), the unclean leader
election is not supported in KRaft mode.
That is, when users enable `unclean.leader.election.enable` in KRaft mode,
the config won't take effect and just behave like
`unclean.leader.election.enable` is disabled.
KAFKA-12670  was opened
for this and is still not resolved.

I think this is a regression issue in KRaft mode, and we should complete
this missing feature in 3.x release, instead of adding it in 4.0.
Does anyone know what's status for this issue?

Thanks.
Luke



On Mon, Nov 27, 2023 at 4:38 PM Colin McCabe  wrote:

> On Fri, Nov 24, 2023, at 03:47, Anton Agestam wrote:
> > In your last message you wrote:
> >
> > > But, on the KRaft side, I still maintain that nothing is missing except
> > > JBOD, which we already have a plan for.
> >
> > But earlier in this thread you mentioned an issue with "torn writes",
> > possibly missing tests, as well as the fact that the recommended method
> of
> > replacing controller nodes is undocumented. Would you mind clarifying
> what
> > your stance is on these three issues? Do you think that they are
> important
> > enablers of upgrade paths or not?
>
> Hi Anton,
>
> There shouldn't be anything blocking controller disk replacement now. From
> memory (not looking at the code now), we do log recovery on our single log
> directory every time we start the controller, so it should handle partial
> records there. I do agree that a test would be good, and some
> documentation. I'll probably take a look at that this week if I get some
> time.
>
> > > Well, the line was drawn in KIP-833. If we redraw it, what is to stop
> us
> > > from redrawing it again and again?
> >
> > I'm fairly new to the Kafka community so please forgive me if I'm missing
> > things that have been said in earlier discussions, but reading up on that
> > KIP I see it has language like "Note: this timeline is very rough and
> > subject to change." in the section of versions, but it also says "As
> > outlined above, we expect to close these gaps soon" with relation to the
> > outstanding features. From my perspective this doesn't really look like
> an
> > agreement that dynamic quorum membership changes shall not be a blocker
> for
> > 4.0.
>
> The timeline was rough because we wrote that in 2022, trying to look
> forward multiple releases. The gaps that were discussed have all been
> closed -- except for JBOD, which we are working on this quarter.
>
> The set of features needed for 4.0 is very clearly described in KIP-833.
> There's no uncertainty on that point.
>
> >
> > To answer the specific question you pose here, "what is to stop us from
> > redrawing it again and again?", wouldn't the suggestion of parallel work
> > lanes brought up by Josep address this concern?
> >
>
> It's very important not to fragment the community by supporting multiple
> long-running branch lines. At the end of the day, once branch 3's time has
> come, it needs to fade away, just like JDK 6 support or the old Scala
> producer.
>
> best,
> Colin
>
>
> > BR,
> > Anton
> >
> > Den tors 23 nov. 2023 kl 05:48 skrev Colin McCabe :
> >
> >> On Tue, Nov 21, 2023, at 19:30, Luke Chen wrote:
> >> > Yes, KIP-853 and disk failure support are both very important missing
> >> > features. For the disk failure support, I don't think this is a
> >> > "good-to-have-feature", it should be a "must-have" IMO. We can't
> announce
> >> > the 4.0 release without a good solution for disk failure in KRaft.
> >>
> >> Hi Luke,
> >>
> >> Thanks for the reply.
> >>
> >> Controller disk failure support is not missing from KRaft. I described
> how
> >> to handle controller disk failures earlier in this thread.
> >>
> >> I should note here that the broker in ZooKeeper mode also requires
> manual
> >> handling of disk failures. Restarting a broker with the same ID, but an
> >> empty disk, breaks the invariants of replication when in ZK mode.
> Consider:
> >>
> >> 1. Broker 1 goes down. A ZK state change notification for /brokers fires
> >> and goes on the controller queue.
> >>
> >> 2. Broker 1 comes back up with an empty disk.
> >>
> >> 3. The controller processes the zk state change notification for
> /brokers.
> >> Since broker 1 is up no action is taken.
> >>
> >> 4. Now broker 1 is in the ISR for any partitions it was previously, but
> >> has no data. If it is or becomes leader for any partitions, irreversable
> >> data loss will occur.
> >>
> >> This problem is more than theoretical. We at Confluent have observed it
> in
> >> production and put in place special workarounds for the ZK clusters we
> >> still have.
> >>
> >> KRaft has never had this problem because brokers are removed from ISRs
> >> when a new incarnation of the broker registers.
> >>
> >> So perhaps ZK mode is not ready for production for Aiven? Since disk
> >> failures do in fact require special handling there. (And/or bringing up
> new
> >> nodes with empty

Jenkins build is unstable: Kafka » Kafka Branch Builder » 3.7 #16

2023-12-18 Thread Apache Jenkins Server
See 




[jira] [Created] (KAFKA-16029) Investigate cause of "Unable to find FetchSessionHandler for node X" in logs

2023-12-18 Thread Kirk True (Jira)
Kirk True created KAFKA-16029:
-

 Summary: Investigate cause of "Unable to find FetchSessionHandler 
for node X" in logs
 Key: KAFKA-16029
 URL: https://issues.apache.org/jira/browse/KAFKA-16029
 Project: Kafka
  Issue Type: Task
  Components: clients, consumer
Affects Versions: 3.7.0
Reporter: Kirk True
Assignee: Kirk True






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Road to Kafka 4.0

2023-12-18 Thread Justine Olshan
Hey Luke --

There were some previous discussions on the mailing list about this but
looks like we didn't file the ticket
https://lists.apache.org/thread/sqsssos1d9whgmo92vdn81n9r5woy1wk

When I asked some of the folks who worked on Kraft about this, they
communicated to me that it was intentional to make unclean leader election
a manual action.

I think that for folks that want to prioritize availability over
durability, the aggressive recovery strategy from KIP-966 should be
preferable to the old unclean leader election configuration.
https://cwiki.apache.org/confluence/display/KAFKA/KIP-966%3A+Eligible+Leader+Replicas#KIP966:EligibleLeaderReplicas-Uncleanrecovery

Let me know if we don't think this is sufficient.

Justine

On Mon, Dec 18, 2023 at 4:39 AM Luke Chen  wrote:

> Hi all,
>
> We found that currently (the latest trunk branch), the unclean leader
> election is not supported in KRaft mode.
> That is, when users enable `unclean.leader.election.enable` in KRaft mode,
> the config won't take effect and just behave like
> `unclean.leader.election.enable` is disabled.
> KAFKA-12670  was opened
> for this and is still not resolved.
>
> I think this is a regression issue in KRaft mode, and we should complete
> this missing feature in 3.x release, instead of adding it in 4.0.
> Does anyone know what's status for this issue?
>
> Thanks.
> Luke
>
>
>
> On Mon, Nov 27, 2023 at 4:38 PM Colin McCabe  wrote:
>
> > On Fri, Nov 24, 2023, at 03:47, Anton Agestam wrote:
> > > In your last message you wrote:
> > >
> > > > But, on the KRaft side, I still maintain that nothing is missing
> except
> > > > JBOD, which we already have a plan for.
> > >
> > > But earlier in this thread you mentioned an issue with "torn writes",
> > > possibly missing tests, as well as the fact that the recommended method
> > of
> > > replacing controller nodes is undocumented. Would you mind clarifying
> > what
> > > your stance is on these three issues? Do you think that they are
> > important
> > > enablers of upgrade paths or not?
> >
> > Hi Anton,
> >
> > There shouldn't be anything blocking controller disk replacement now.
> From
> > memory (not looking at the code now), we do log recovery on our single
> log
> > directory every time we start the controller, so it should handle partial
> > records there. I do agree that a test would be good, and some
> > documentation. I'll probably take a look at that this week if I get some
> > time.
> >
> > > > Well, the line was drawn in KIP-833. If we redraw it, what is to stop
> > us
> > > > from redrawing it again and again?
> > >
> > > I'm fairly new to the Kafka community so please forgive me if I'm
> missing
> > > things that have been said in earlier discussions, but reading up on
> that
> > > KIP I see it has language like "Note: this timeline is very rough and
> > > subject to change." in the section of versions, but it also says "As
> > > outlined above, we expect to close these gaps soon" with relation to
> the
> > > outstanding features. From my perspective this doesn't really look like
> > an
> > > agreement that dynamic quorum membership changes shall not be a blocker
> > for
> > > 4.0.
> >
> > The timeline was rough because we wrote that in 2022, trying to look
> > forward multiple releases. The gaps that were discussed have all been
> > closed -- except for JBOD, which we are working on this quarter.
> >
> > The set of features needed for 4.0 is very clearly described in KIP-833.
> > There's no uncertainty on that point.
> >
> > >
> > > To answer the specific question you pose here, "what is to stop us from
> > > redrawing it again and again?", wouldn't the suggestion of parallel
> work
> > > lanes brought up by Josep address this concern?
> > >
> >
> > It's very important not to fragment the community by supporting multiple
> > long-running branch lines. At the end of the day, once branch 3's time
> has
> > come, it needs to fade away, just like JDK 6 support or the old Scala
> > producer.
> >
> > best,
> > Colin
> >
> >
> > > BR,
> > > Anton
> > >
> > > Den tors 23 nov. 2023 kl 05:48 skrev Colin McCabe  >:
> > >
> > >> On Tue, Nov 21, 2023, at 19:30, Luke Chen wrote:
> > >> > Yes, KIP-853 and disk failure support are both very important
> missing
> > >> > features. For the disk failure support, I don't think this is a
> > >> > "good-to-have-feature", it should be a "must-have" IMO. We can't
> > announce
> > >> > the 4.0 release without a good solution for disk failure in KRaft.
> > >>
> > >> Hi Luke,
> > >>
> > >> Thanks for the reply.
> > >>
> > >> Controller disk failure support is not missing from KRaft. I described
> > how
> > >> to handle controller disk failures earlier in this thread.
> > >>
> > >> I should note here that the broker in ZooKeeper mode also requires
> > manual
> > >> handling of disk failures. Restarting a broker with the same ID, but
> an
> > >> empty disk, breaks the invariants of replication when in ZK

RE: DISCUSS KIP-984 Add pluggable compression interface to Kafka

2023-12-18 Thread Diop, Assane
I would like to bring some attention to this KIP. We have added an interface to 
the compression code that allow anyone to build their own compression plugin 
and integrate easily back to kafka. 

Assane 

-Original Message-
From: Diop, Assane  
Sent: Monday, October 2, 2023 9:27 AM
To: dev@kafka.apache.org
Subject: DISCUSS KIP-984 Add pluggable compression interface to Kafka

https://cwiki.apache.org/confluence/display/KAFKA/KIP-984%3A+Add+pluggable+compression+interface+to+Kafka


Jenkins build is still unstable: Kafka » Kafka Branch Builder » 3.7 #17

2023-12-18 Thread Apache Jenkins Server
See 




[jira] [Created] (KAFKA-16030) new group coordinator should check if partition goes offline during load

2023-12-18 Thread Jeff Kim (Jira)
Jeff Kim created KAFKA-16030:


 Summary: new group coordinator should check if partition goes 
offline during load
 Key: KAFKA-16030
 URL: https://issues.apache.org/jira/browse/KAFKA-16030
 Project: Kafka
  Issue Type: Sub-task
Reporter: Jeff Kim
Assignee: Jeff Kim


The new coordinator stops loading if the partition goes offline during load. 
However, the partition is still considered active. Instead, we should throw 
return NOT_LEADER_OR_FOLLOWER exception during load.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Requesting permission to contribute to Apache Kafka

2023-12-18 Thread Roman Bondar
Hi, Yash,
thank you so much!
Now I have renamed myself in latin.
May I ask for advice on how to get to know the community? The
"getting-started" guide suggests introducing myself to the dev list, which
I definitely plan to do, I suppose, in a new thread with an appropriate
subject. And as for the bootup process, I could start by participating in
fixing a minor bug/issue. I would really appreciate any hint or comment on
how to get involved in the workflow.

Thank you in advance!
Sincerely,
Roman,
bond20...@gmail.com

On Fri, Dec 15, 2023 at 10:25 AM Yash Mayya  wrote:

> Hi Roman,
>
> I've granted the required permissions to your accounts.
>
> Cheers,
> Yash
>
> On Fri, Dec 15, 2023 at 12:12 PM Роман Бондарь 
> wrote:
>
> > Hi all,
> >
> > Please add me as a contributor to kafka project
> >
> > JIRA username: rbond
> > Wiki username: rbond
> > GitHub username: gitrbond
> >
> > Thank you,
> > Roman Bondar
> >
>


Re: [DISCUSS] KIP-939: Support Participation in 2PC

2023-12-18 Thread Artem Livshits
Hi Justine,

I've updated the KIP based on the KIP-890 updates.  Now KIP-939 only needs
to add one tagged field NextProducerEpoch as the other required fields will
be added as part of KIP-890.

> But here we could call the InitProducerId multiple times and we only want
the producer with the correct epoch to be able to commit the transaction

That's correct, the epoch cannot be inferred from the state in this case
because InitProducerId can be called multiple times.  I've also added an
example in the KIP that walks through the epoch overflow scenarios.

-Artem


On Thu, Dec 7, 2023 at 2:43 PM Justine Olshan 
wrote:

> Hey Artem,
>
> Thanks for the updates. I think what you say makes sense. I just updated my
> KIP so I want to reconcile some of the changes we made especially with
> respect to the TransactionLogValue.
>
> Firstly, I believe tagged fields require a default value so that if they
> are not filled, we return the default (and know that they were empty). For
> my KIP, I proposed the default for producer ID tagged fields should be -1.
> I was wondering if we could update the KIP to include the default values
> for producer ID and epoch.
>
> Next, I noticed we decided to rename the fields. I guess that the field
> "NextProducerId" in my KIP correlates to "ProducerId" in this KIP. Is that
> correct? So we would have "TransactionProducerId" for the non-tagged field
> and have "ProducerId" (NextProducerId) and "PrevProducerId" as tagged
> fields the final version after KIP-890 and KIP-936 are implemented. Is this
> correct? I think the tags will need updating, but that is trivial.
>
> The final question I had was with respect to storing the new epoch. In
> KIP-890 part 2 (epoch bumps) I think we concluded that we don't need to
> store the epoch since we can interpret the previous epoch based on the
> producer ID. But here we could call the InitProducerId multiple times and
> we only want the producer with the correct epoch to be able to commit the
> transaction. Is that the correct reasoning for why we need epoch here but
> not the Prepare/Commit state.
>
> Thanks,
> Justine
>
> On Wed, Nov 22, 2023 at 9:48 AM Artem Livshits
>  wrote:
>
> > Hi Justine,
> >
> > After thinking a bit about supporting atomic dual writes for Kafka +
> NoSQL
> > database, I came to a conclusion that we do need to bump the epoch even
> > with InitProducerId(keepPreparedTxn=true).  As I described in my previous
> > email, we wouldn't need to bump the epoch to protect from zombies so that
> > reasoning is still true.  But we cannot protect from split-brain
> scenarios
> > when two or more instances of a producer with the same transactional id
> try
> > to produce at the same time.  The dual-write example for SQL databases (
> > https://github.com/apache/kafka/pull/14231/files) doesn't have a
> > split-brain problem because execution is protected by the update lock on
> > the transaction state record; however NoSQL databases may not have this
> > protection (I'll write an example for NoSQL database dual-write soon).
> >
> > In a nutshell, here is an example of a split-brain scenario:
> >
> >1. (instance1) InitProducerId(keepPreparedTxn=true), got epoch=42
> >2. (instance2) InitProducerId(keepPreparedTxn=true), got epoch=42
> >3. (instance1) CommitTxn, epoch bumped to 43
> >4. (instance2) CommitTxn, this is considered a retry, so it got epoch
> 43
> >as well
> >5. (instance1) Produce messageA w/sequence 1
> >6. (instance2) Produce messageB w/sequence 1, this is considered a
> >duplicate
> >7. (instance2) Produce messageC w/sequence 2
> >8. (instance1) Produce messageD w/sequence 2, this is considered a
> >duplicate
> >
> > Now if either of those commit the transaction, it would have a mix of
> > messages from the two instances (messageA and messageC).  With the proper
> > epoch bump, instance1 would get fenced at step 3.
> >
> > In order to update epoch in InitProducerId(keepPreparedTxn=true) we need
> to
> > preserve the ongoing transaction's epoch (and producerId, if the epoch
> > overflows), because we'd need to make a correct decision when we compare
> > the PreparedTxnState that we read from the database with the (producerId,
> > epoch) of the ongoing transaction.
> >
> > I've updated the KIP with the following:
> >
> >- Ongoing transaction now has 2 (producerId, epoch) pairs -- one pair
> >describes the ongoing transaction, the other pair describes expected
> > epoch
> >for operations on this transactional id
> >- InitProducerIdResponse now returns 2 (producerId, epoch) pairs
> >- TransactionalLogValue now has 2 (producerId, epoch) pairs, the new
> >values added as tagged fields, so it's easy to downgrade
> >- Added a note about downgrade in the Compatibility section
> >- Added a rejected alternative
> >
> > -Artem
> >
> > On Fri, Oct 6, 2023 at 5:16 PM Artem Livshits 
> > wrote:
> >
> > > Hi Justine,
> > >
> > > Thank you for the questions.  Current

Re: [DISCUSS] KIP-939: Support Participation in 2PC

2023-12-18 Thread Artem Livshits
Hi Jun,

Thank you for the comments.

> 10. For the two new fields in Enable2Pc and KeepPreparedTxn ...

I added a note that all combinations are valid.  Enable2Pc=false &
KeepPreparedTxn=true could be potentially useful for backward compatibility
with Flink, when the new version of Flink that implements KIP-319 tries to
work with a cluster that doesn't authorize 2PC.

> 11.  InitProducerIdResponse: If there is no ongoing txn, what will
OngoingTxnProducerId and OngoingTxnEpoch be set?

I added a note that they will be set to -1.  The client then will know that
there is no ongoing txn and .completeTransaction becomes a no-op (but still
required before .send is enabled).

> 12. ListTransactionsRequest related changes: It seems those are already
covered by KIP-994?

Removed from this KIP.

> 13. TransactionalLogValue ...

This is now updated to work on top of KIP-890.

> 14. "Note that the (producerId, epoch) pair that corresponds to the
ongoing transaction ...

This is now updated to work on top of KIP-890.

> 15. active-transaction-total-time-max : ...

Updated.

> 16. "transaction.two.phase.commit.enable The default would be ‘false’.
If it’s ‘false’, 2PC functionality is disabled even if the ACL is set ...

Disabling 2PC effectively removes all authorization to use it, hence I
thought TRANSACTIONAL_ID_AUTHORIZATION_FAILED would be appropriate.

Do you suggest using a different error code for 2PC authorization vs some
other authorization (e.g. TRANSACTIONAL_ID_2PC_AUTHORIZATION_FAILED) or a
different code for disabled vs. unauthorised (e.g.
TWO_PHASE_COMMIT_DISABLED) or both?

> 17. completeTransaction(). We expect this to be only used during recovery.

It can also be used if, say, a commit to the database fails and the result
is inconclusive, e.g.

1. Begin DB transaction
2. Begin Kafka transaction
3. Prepare Kafka transaction
4. Commit DB transaction
5. The DB commit fails, figure out the state of the transaction by reading
the PreparedTxnState from DB
6. Complete Kafka transaction with the PreparedTxnState.

> 18. "either prepareTransaction was called or initTransaction(true) was
called": "either" should be "neither"?

Updated.

> 19. Since InitProducerId always bumps up the epoch, it creates a
situation ...

InitProducerId only bumps the producer epoch, the ongoing transaction epoch
stays the same, no matter how many times the InitProducerId is called
before the transaction is completed.  Eventually the epoch may overflow,
and then a new producer id would be allocated, but the ongoing transaction
producer id would stay the same.

I've added a couple examples in the KIP (
https://cwiki.apache.org/confluence/display/KAFKA/KIP-939%3A+Support+Participation+in+2PC#KIP939:SupportParticipationin2PC-PersistedDataFormatChanges)
that walk through some scenarios and show how the state is changed.

-Artem

On Fri, Dec 8, 2023 at 6:04 PM Jun Rao  wrote:

> Hi, Artem,
>
> Thanks for the KIP. A few comments below.
>
> 10. For the two new fields in Enable2Pc and KeepPreparedTxn in
> InitProducerId, it would be useful to document a bit more detail on what
> values are set under what cases. For example, are all four combinations
> valid?
>
> 11.  InitProducerIdResponse: If there is no ongoing txn, what will
> OngoingTxnProducerId and OngoingTxnEpoch be set?
>
> 12. ListTransactionsRequest related changes: It seems those are already
> covered by KIP-994?
>
> 13. TransactionalLogValue: Could we name TransactionProducerId and
> ProducerId better? It's not clear from the name which is for which.
>
> 14. "Note that the (producerId, epoch) pair that corresponds to the ongoing
> transaction is going to be written instead of the existing ProducerId and
> ProducerEpoch fields (which are renamed to reflect the semantics) to
> support downgrade.": I am a bit confused on that. Are we writing different
> values to the existing fields? Then, we can't downgrade, right?
>
> 15. active-transaction-total-time-max : Would
> active-transaction-open-time-max be more intuitive? Also, could we include
> the full name (group, tags, etc)?
>
> 16. "transaction.two.phase.commit.enable The default would be ‘false’.  If
> it’s ‘false’, 2PC functionality is disabled even if the ACL is set, clients
> that attempt to use this functionality would receive
> TRANSACTIONAL_ID_AUTHORIZATION_FAILED error."
> TRANSACTIONAL_ID_AUTHORIZATION_FAILED seems unintuitive for the client to
> understand what the actual cause is.
>
> 17. completeTransaction(). We expect this to be only used during recovery.
> Could we document this clearly? Could we prevent it from being used
> incorrectly (e.g. throw an exception if the producer has called other
> methods like send())?
>
> 18. "either prepareTransaction was called or initTransaction(true) was
> called": "either" should be "neither"?
>
> 19. Since InitProducerId always bumps up the epoch, it creates a situation
> where there could be multiple outstanding txns. The following is an example
> of a potential problem during

Jenkins build is unstable: Kafka » Kafka Branch Builder » trunk #2493

2023-12-18 Thread Apache Jenkins Server
See 




Jenkins build is still unstable: Kafka » Kafka Branch Builder » 3.7 #18

2023-12-18 Thread Apache Jenkins Server
See 




Build failed in Jenkins: Kafka » Kafka Branch Builder » trunk #2494

2023-12-18 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 348060 lines...]
Gradle Test Run :streams:test > Gradle Test Executor 91 > 
RocksDBMetricsRecorderTest > 
shouldThrowIfCacheToAddIsNotNullButExistingCacheIsNull() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 91 > 
RocksDBMetricsRecorderTest > 
shouldThrowIfCacheToAddIsNotNullButExistingCacheIsNull() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 91 > 
RocksDBMetricsRecorderTest > 
shouldThrowIfCacheToAddIsNullButExistingCacheIsNotNull() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 91 > 
RocksDBMetricsRecorderTest > 
shouldThrowIfCacheToAddIsNullButExistingCacheIsNotNull() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 91 > 
RocksDBMetricsRecorderTest > 
shouldThrowIfStatisticsToAddIsNotNullButExistingStatisticsAreNull() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 91 > 
RocksDBMetricsRecorderTest > 
shouldThrowIfStatisticsToAddIsNotNullButExistingStatisticsAreNull() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 91 > 
RocksDBMetricsRecorderTest > 
shouldThrowIfCacheToAddIsNotSameAsAllExistingCaches() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 91 > 
RocksDBMetricsRecorderTest > 
shouldThrowIfCacheToAddIsNotSameAsAllExistingCaches() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 91 > 
RocksDBMetricsRecorderTest > 
shouldNotRecordStatisticsBasedMetricsIfStatisticsIsNull() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 91 > 
RocksDBMetricsRecorderTest > 
shouldNotRecordStatisticsBasedMetricsIfStatisticsIsNull() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 91 > 
RocksDBMetricsRecorderTest > 
shouldSetStatsLevelToExceptDetailedTimersWhenValueProvidersWithStatisticsAreAdded()
 STARTED

Gradle Test Run :streams:test > Gradle Test Executor 91 > 
RocksDBMetricsRecorderTest > 
shouldSetStatsLevelToExceptDetailedTimersWhenValueProvidersWithStatisticsAreAdded()
 PASSED

Gradle Test Run :streams:test > Gradle Test Executor 91 > 
RocksDBMetricsRecorderTest > shouldRecordStatisticsBasedMetrics() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 91 > 
RocksDBMetricsRecorderTest > shouldRecordStatisticsBasedMetrics() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 91 > 
RocksDBMetricsRecorderTest > 
shouldThrowIfMetricRecorderIsReInitialisedWithDifferentStreamsMetrics() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 91 > 
RocksDBMetricsRecorderTest > 
shouldThrowIfMetricRecorderIsReInitialisedWithDifferentStreamsMetrics() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 91 > 
RocksDBMetricsRecorderTest > shouldInitMetricsRecorder() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 91 > 
RocksDBMetricsRecorderTest > shouldInitMetricsRecorder() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 91 > 
RocksDBMetricsRecorderTest > 
shouldThrowIfMetricRecorderIsReInitialisedWithDifferentTask() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 91 > 
RocksDBMetricsRecorderTest > 
shouldThrowIfMetricRecorderIsReInitialisedWithDifferentTask() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 91 > 
RocksDBMetricsRecorderTest > 
shouldCorrectlyHandleAvgRecordingsWithZeroSumAndCount() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 91 > 
RocksDBMetricsRecorderTest > 
shouldCorrectlyHandleAvgRecordingsWithZeroSumAndCount() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 91 > 
RocksDBMetricsRecorderTest > 
shouldThrowIfStatisticsToAddIsNullButExistingStatisticsAreNotNull() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 91 > 
RocksDBMetricsRecorderTest > 
shouldThrowIfStatisticsToAddIsNullButExistingStatisticsAreNotNull() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 91 > 
RocksDBMetricsRecorderTest > shouldNotAddItselfToRecordingTriggerWhenNotEmpty() 
STARTED

Gradle Test Run :streams:test > Gradle Test Executor 91 > 
RocksDBMetricsRecorderTest > shouldNotAddItselfToRecordingTriggerWhenNotEmpty() 
PASSED

streams-5: SMOKE-TEST-CLIENT-CLOSED
streams-1: SMOKE-TEST-CLIENT-CLOSED
streams-2: SMOKE-TEST-CLIENT-CLOSED
streams-5: SMOKE-TEST-CLIENT-CLOSED
streams-0: SMOKE-TEST-CLIENT-CLOSED
streams-6: SMOKE-TEST-CLIENT-CLOSED
streams-0: SMOKE-TEST-CLIENT-CLOSED
streams-4: SMOKE-TEST-CLIENT-CLOSED
streams-0: SMOKE-TEST-CLIENT-CLOSED
streams-1: SMOKE-TEST-CLIENT-CLOSED
streams-5: SMOKE-TEST-CLIENT-CLOSED
streams-4: SMOKE-TEST-CLIENT-CLOSED
streams-4: SMOKE-TEST-CLIENT-CLOSED
streams-6: SMOKE-TEST-CLIENT-CLOSED
streams-7: SMOKE-TEST-CLIENT-CLOSED
streams-1: SMOKE-TEST-CLIENT-CLOSED
streams-2: SMOKE-TEST-CLIENT-CLOSED
streams-3: SMOKE-TEST-CLIENT-CLOSED
streams-3: SMOKE-TEST-CLIENT-CLOSED
streams-7: SMOKE-TEST-CLIENT-CLOSED
streams-3: SMOKE-TEST-CLIENT-CLOSED
streams-2: SMOKE-TEST-CLIENT-CLO

[jira] [Created] (KAFKA-16031) Enabling testApplyDeltaShouldHandleReplicaAssignedToOnlineDirectory for tiered storage after supporting JBOD

2023-12-18 Thread Luke Chen (Jira)
Luke Chen created KAFKA-16031:
-

 Summary: Enabling 
testApplyDeltaShouldHandleReplicaAssignedToOnlineDirectory for tiered storage 
after supporting JBOD
 Key: KAFKA-16031
 URL: https://issues.apache.org/jira/browse/KAFKA-16031
 Project: Kafka
  Issue Type: Test
  Components: Tiered-Storage
Reporter: Luke Chen


Currently, tiered storage doesn't support JBOD (multiple log dirs). The test  
testApplyDeltaShouldHandleReplicaAssignedToOnlineDirectory requires multiple 
log dirs to run it. We should enable it for tiered storage after supporting 
JBOD in tiered storage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Jenkins build is still unstable: Kafka » Kafka Branch Builder » 3.7 #19

2023-12-18 Thread Apache Jenkins Server
See 




DISCUSS KIP-1011: Use incrementalAlterConfigs when updating broker configs by kafka-configs.sh

2023-12-18 Thread ziming deng

Hello, I want to start a discussion on KIP-1011, to make the broker config 
change path unified with that of user/topic/client-metrics and avoid some bugs.

Here is the link: 

https://cwiki.apache.org/confluence/display/KAFKA/KIP-1011%3A+Use+incrementalAlterConfigs+when+updating+broker+configs+by+kafka-configs.sh

Best, 
Ziming.

Re: DISCUSS KIP-1011: Use incrementalAlterConfigs when updating broker configs by kafka-configs.sh

2023-12-18 Thread Ismael Juma
Thanks for the KIP. I think one of the main benefits of the change isn't
listed: sensitive configs make it impossible to make updates with the
current cli tool because sensitive config values are never returned.

Ismael

On Mon, Dec 18, 2023 at 7:58 PM ziming deng 
wrote:

>
> Hello, I want to start a discussion on KIP-1011, to make the broker config
> change path unified with that of user/topic/client-metrics and avoid some
> bugs.
>
> Here is the link:
>
> KIP-1011: Use incrementalAlterConfigs when updating broker configs by
> kafka-configs.sh - Apache Kafka - Apache Software Foundation
> 
> cwiki.apache.org
> 
> [image: favicon.ico]
> 
> 
>
> Best,
> Ziming.
>


[jira] [Resolved] (KAFKA-15158) Add metrics for RemoteDeleteRequestsPerSec, RemoteDeleteErrorsPerSec, BuildRemoteLogAuxStateRequestsPerSec, BuildRemoteLogAuxStateErrorsPerSec

2023-12-18 Thread Luke Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke Chen resolved KAFKA-15158.
---
Resolution: Fixed

> Add metrics for RemoteDeleteRequestsPerSec, RemoteDeleteErrorsPerSec, 
> BuildRemoteLogAuxStateRequestsPerSec, BuildRemoteLogAuxStateErrorsPerSec
> --
>
> Key: KAFKA-15158
> URL: https://issues.apache.org/jira/browse/KAFKA-15158
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Divij Vaidya
>Assignee: Gantigmaa Selenge
>Priority: Major
> Fix For: 3.7.0
>
>
> Add the following metrics for better observability into the RemoteLog related 
> activities inside the broker.
> 1. RemoteWriteRequestsPerSec
> 2. RemoteDeleteRequestsPerSec
> 3. BuildRemoteLogAuxStateRequestsPerSec
>  
> These metrics will be calculated at topic level (we can add them at 
> brokerTopicStats)
> -*RemoteWriteRequestsPerSec* will be marked on every call to 
> RemoteLogManager#-
> -copyLogSegmentsToRemote()- already covered by KAFKA-14953
>  
> *RemoteDeleteRequestsPerSec* will be marked on every call to 
> RemoteLogManager#cleanupExpiredRemoteLogSegments(). This method is introduced 
> in [https://github.com/apache/kafka/pull/13561] 
> *BuildRemoteLogAuxStateRequestsPerSec* will be marked on every call to 
> ReplicaFetcherTierStateMachine#buildRemoteLogAuxState()
>  
> (Note: For all the above, add Error metrics as well such as 
> RemoteDeleteErrorPerSec)
> (Note: This requires a change in KIP-405 and hence, must be approved by KIP 
> author [~satishd] )
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)