[jira] [Created] (KAFKA-17184) Remote index cache noisy logging

2024-07-23 Thread Francois Visconte (Jira)
Francois Visconte created KAFKA-17184:
-

 Summary: Remote index cache noisy logging
 Key: KAFKA-17184
 URL: https://issues.apache.org/jira/browse/KAFKA-17184
 Project: Kafka
  Issue Type: Bug
Reporter: Francois Visconte


We have a tiered storage cluster where some consumers are constantly lagging 
behind. 

On this cluster, we get a ton of error logs and fail fetches with the following 
symptom: 


{code:java}
java.lang.IllegalStateException: This entry is marked for cleanup
at 
org.apache.kafka.storage.internals.log.RemoteIndexCache$Entry.lookupOffset(RemoteIndexCache.java:569)
at 
org.apache.kafka.storage.internals.log.RemoteIndexCache.lookupOffset(RemoteIndexCache.java:446)
at 
kafka.log.remote.RemoteLogManager.lookupPositionForOffset(RemoteLogManager.java:1445)
at kafka.log.remote.RemoteLogManager.read(RemoteLogManager.java:1391)
at kafka.log.remote.RemoteLogReader.call(RemoteLogReader.java:62)
at kafka.log.remote.RemoteLogReader.call(RemoteLogReader.java:31)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
{code}

I believe this should be handled differently:

* Log should be warn or info
* We should reload the index when an offset is requested and the entry is 
marked for cleanup. 


We do use the default setting for 
{{remote.log.index.file.cache.total.size.bytes}} (1GiB).







--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] KIP-1059: Enable the Producer flush() method to clear the latest send() error

2024-07-23 Thread Andrew Schofield
Hi,
I’m trying to understand the crux of the discussion here.

In systems I’m used to, it’s entirely allowed to attempt to add an operation to 
a transaction,
have that operation unambiguously fail, and continue with the transaction safe 
in the
knowledge that the failed operation didn’t make it, but the transaction is 
intact. The
“unambiguously” part of this is key. If the application doesn’t really know 
what’s in the
transaction, it’s not safe to commit it. It is of course always safe to roll it 
back.

So, I think the core question is whether this KIP can achieve this for Kafka for
Producer.send() operations. Personally, I’m pretty happy with the Producer 
putting
the transaction into error state unless it is safely handling a small number of 
errors
for situations in which it’s sure that the transaction is intact.

I’m comfortable with an additional method, such as prepare(), so that the 
current
behaviour of send() is retained wherein a failed operation always puts the
transaction into an error state so it cannot be committed. This is probably more
robust in terms of unambiguous behaviour than overloading send() or flush().

Thanks,
Andrew

> On 23 Jul 2024, at 00:04, Matthias J. Sax  wrote:
>
> Thanks Greg,
>
> Apologize if we gave the impression that we did not try to address your 
> concerns. It was not my intention to just minimize them. Of course, I just 
> don't see it your way, but that's ok. We don't have to agree. That's why we 
> have a DISCUSS thread to begin with.
>
> I don't know the producer internal well enough to judge, how easy/difficult 
> it would be to implement a `prepare(..)` method as proposed. If it works, I 
> won't object to go down this route. Happy to disagree and commit.
>
>
> I just don't see why `prepare()` is semantically cleaner/better as you say, 
> compared to a `send()` which would throw an exception directly? But this 
> might be personal preference / judgment, and it might not benefit this KIP to 
> argue about it further.
>
>
> I am wondering about interleaved calls of `prepare()` and regular `send()` 
> though, especially given that the producer is thread-safe. Maybe there is 
> nothing to worry about (as said, I don't know the producer internal well 
> enough), but if this would cause issues, it might not be the best way forward.
>
>
> In the end it's Alieh's KIP, and it seems adding `prepare(...)` will enlarge 
> to scope of the KIP. So it's her call if she wants to go down this path or 
> not.
>
>
> -Matthias
>
>
> On 7/22/24 12:30 PM, Greg Harris wrote:
>> Hi Alieh,
>> Yes, I think you understand my intent for the prepare() method.
>> Thanks,
>> Greg
>> On Mon, Jul 22, 2024 at 2:54 AM Alieh Saeedi 
>> wrote:
>>> Hi Greg,
>>>
>>>
>>> I appreciate your concerns and comprehensive answer.
>>>
>>>
>>> I am not sure whether I fully understood what you meant or not. You mean,
>>> at the end, the user can go for one of the following scenarios: Either
>>>
>>> 1) `beginTxn()` and `send(record)` and `commitTxn()`  or
>>>
>>> 2) `beginTxn()` and `prepare(record)` and `send(prepared_record)` and
>>> `commitTxn()` ?
>>>
>>>
>>> Of course, the `send` in scenario 1 is different from the one in scenario
>>> 2, since a part of the second one 's job has been done during
>>> `prepare()`ing.
>>>
>>>
>>> Cheers,
>>>
>>> Alieh
>>>
>>> On Sat, Jul 20, 2024 at 1:20 AM Greg Harris 
>>> wrote:
>>>
 Hi Artem and Matthias,

> On the other hand, the effort to prove that
> keeping all records in memory won't break some scenarios (and generally
> breaking one is enough to cause a lot of pain) seems to be
>>> significantly
> higher than to prove that setting some flag in some API has pretty
>>> much 0
> chance of regression

> in the end, why buffer records twice?

> This way we don't
> ignore the error, we're just changing the method they are delivered.

> Very clean semantics
> which should also address the concern of "non-atomic tx"

 I feel like my concerns are being minimized instead of being addressed
 in this discussion, and if that's because I'm not expressing them
>>> clearly,
 I apologize.

 Many users come to Kafka with prior expectations, especially when we use
 industry-standard terminology like 'Exactly Once Semantics",
 "Transactions", "Commit", "Abort". Of course Kafka isn't an
>>> ACID-compliant
 database, but users will evaluate, discuss, and develop applications with
 Kafka through the lens of the ACID principles, because that is the
 framework most commonly applied to transactional semantics.
 The original design of KIP-98 [1] explicitly mentions atomic commits
>>> (with
 the same meaning as the A in ACID) as the primary abstraction being added
 (reproduced here):

> At the core, transactional guarantees enable applications to produce to
 multiple TopicPartitions atomically, ie. all writes to these
 TopicPartitions will succeed or fail as 

Re: [DISCUSS] KIP-1023: Follower fetch from tiered offset

2024-07-23 Thread Abhijeet Kumar
Hi Divij,

Seems like there is some confusion about the new protocol for fetching from
tiered offset.
The scenario you are highlighting is where,
Leader's Log Start Offset = 0
Last Tiered Offset = 10

Following is the sequence of events that will happen:

1. Follower requests offset 0 from the leader
2. Assuming offset 0 is not available locally (to arrive at your scenario),
Leader returns OffsetMovedToTieredStorageException
3. Follower fetches the earliest pending upload offset and receives 11
4. Follower builds aux state from [0-10] and sets the fetch offset to 11
(This step corresponds to step 3 in your email)

At this stage, even if the leader has uploaded more data and the
last-tiered offset has changed (say to 15), it will not matter
because offset 11 should still be available on the leader and when the
follower requests data with fetch offset 11, the leader
will return with a valid partition data response which the follower can
consume and proceed further. Even if the offset 11 is not
available anymore, the follower will eventually be able to catch up with
the leader by resetting its fetch offset until the offset
is available on the leader's local log. Once it catches up, replication on
the follower can proceed.

Regards,
Abhijeet.



On Tue, Jul 2, 2024 at 3:55 PM Divij Vaidya  wrote:

> Hi folks.
>
> I am late to the party but I have a question on the proposal.
>
> How are we preventing a situation such as the following:
>
> 1. Empty follower asks leader for 0
> 2. Leader compares 0 with last-tiered-offset, and responds with 11 (where10
> is last-tiered-offset) and a OffsetMovedToTieredException
> 3. Follower builds aux state from [0-10] and sets the fetch offset to 11
> 4. But leader has already uploaded more data and now the new
> last-tiered-offset is 15
> 5. Go back to 2
>
> This could cause a cycle where the replica will be stuck trying to
> reconcile with the leader.
>
> --
> Divij Vaidya
>
>
>
> On Fri, Apr 26, 2024 at 7:28 AM Abhijeet Kumar  >
> wrote:
>
> > Thank you all for your comments. As all the comments in the thread are
> > addressed, I am starting a Vote thread for the KIP. Please have a look.
> >
> > Regards.
> > Abhijeet.
> >
> >
> >
> > On Thu, Apr 25, 2024 at 6:08 PM Luke Chen  wrote:
> >
> > > Hi, Abhijeet,
> > >
> > > Thanks for the update.
> > >
> > > I have no more comments.
> > >
> > > Luke
> > >
> > > On Thu, Apr 25, 2024 at 4:21 AM Jun Rao 
> > wrote:
> > >
> > > > Hi, Abhijeet,
> > > >
> > > > Thanks for the updated KIP. It looks good to me.
> > > >
> > > > Jun
> > > >
> > > > On Mon, Apr 22, 2024 at 12:08 PM Abhijeet Kumar <
> > > > abhijeet.cse@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Jun,
> > > > >
> > > > > Please find my comments inline.
> > > > >
> > > > >
> > > > > On Thu, Apr 18, 2024 at 3:26 AM Jun Rao 
> > > > wrote:
> > > > >
> > > > > > Hi, Abhijeet,
> > > > > >
> > > > > > Thanks for the reply.
> > > > > >
> > > > > > 1. I am wondering if we could achieve the same result by just
> > > lowering
> > > > > > local.retention.ms and local.retention.bytes. This also allows
> the
> > > > newly
> > > > > > started follower to build up the local data before serving the
> > > consumer
> > > > > > traffic.
> > > > > >
> > > > >
> > > > > I am not sure I fully followed this. Do you mean we could lower the
> > > > > local.retention (by size and time)
> > > > > so that there is little data on the leader's local storage so that
> > the
> > > > > follower can quickly catch up with the leader?
> > > > >
> > > > > In that case, we will need to set small local retention across
> > brokers
> > > in
> > > > > the cluster. It will have the undesired
> > > > > effect where there will be increased remote log fetches for serving
> > > > consume
> > > > > requests, and this can cause
> > > > > degradations. Also, this behaviour (of increased remote fetches)
> will
> > > > > happen on all brokers at all times, whereas in
> > > > > the KIP we are restricting the behavior only to the newly
> > bootstrapped
> > > > > brokers and only until the time it fully builds
> > > > > the local logs as per retention defined at the cluster level.
> > > > > (Deprioritization of the broker could help reduce the impact
> > > > >  even further)
> > > > >
> > > > >
> > > > > >
> > > > > > 2. Have you updated the KIP?
> > > > > >
> > > > >
> > > > > The KIP has been updated now.
> > > > >
> > > > >
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Tue, Apr 9, 2024 at 3:36 AM Satish Duggana <
> > > > satish.dugg...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > +1 to Jun for adding the consumer fetching from a follower
> > scenario
> > > > > > > also to the existing section that talked about the drawback
> when
> > a
> > > > > > > node built with last-tiered-offset has become a leader. As
> > Abhijeet
> > > > > > > mentioned, we plan to have a follow-up KIP that will address
> > these
> > > by
> > > > > > > having a deprioritzation of these brokers.

[VOTE] 3.8.0 RC3

2024-07-23 Thread Josep Prat
Hello Kafka users, developers and client-developers,

This is the fourth candidate for release of Apache Kafka 3.8.0.

Some of the major features included in this release are:
* KIP-1028: Docker Official Image for Apache Kafka
* KIP-974: Docker Image for GraalVM based Native Kafka Broker
* KIP-1036: Extend RecordDeserializationException exception
* KIP-1019: Expose method to determine Metric Measurability
* KIP-1004: Enforce tasks.max property in Kafka Connect
* KIP-989: Improved StateStore Iterator metrics for detecting leaks
* KIP-993: Allow restricting files accessed by File and Directory
ConfigProviders
* KIP-924: customizable task assignment for Streams
* KIP-813: Shareable State Stores
* KIP-719: Deprecate Log4J Appender
* KIP-390: Support Compression Level
* KIP-1018: Introduce max remote fetch timeout config for
DelayedRemoteFetch requests
* KIP-1037: Allow WriteTxnMarkers API with Alter Cluster Permission
* KIP-1047 Introduce new org.apache.kafka.tools.api.Decoder to replace
kafka.serializer.Decoder
* KIP-899: Allow producer and consumer clients to rebootstrap

Release notes for the 3.8.0 release:
https://home.apache.org/~jlprat/kafka-3.8.0-rc3/RELEASE_NOTES.html

 Please download, test and vote by Friday, July 26, 9am PT*


Kafka's KEYS file containing PGP keys we use to sign the release:
https://kafka.apache.org/KEYS

* Release artifacts to be voted upon (source and binary):
https://home.apache.org/~jlprat/kafka-3.8.0-rc3/

* Docker release artifact to be voted upon:
apache/kafka:3.8.0-rc3
apache/kafka-native:3.8.0-rc3

* Maven artifacts to be voted upon:
https://repository.apache.org/content/groups/staging/org/apache/kafka/

* Javadoc:
https://home.apache.org/~jlprat/kafka-3.8.0-rc3/javadoc/

* Tag to be voted upon (off 3.8 branch) is the 3.8.0 tag:
https://github.com/apache/kafka/releases/tag/3.8.0-rc3

* Documentation:
https://kafka.apache.org/38/documentation.html

* Protocol:
https://kafka.apache.org/38/protocol.html

* Successful Jenkins builds for the 3.8 branch:
Unit/integration tests:
https://ci-builds.apache.org/job/Kafka/job/kafka/job/3.8/73/
System tests: Still running

* Successful Docker Image Github Actions Pipeline for 3.8 branch:
Docker Build Test Pipeline (JVM):
https://github.com/apache/kafka/actions/runs/10055827182
Docker Build Test Pipeline (Native):
https://github.com/apache/kafka/actions/runs/10055829295


Thanks,

-- 
[image: Aiven] 

*Josep Prat*
Open Source Engineering Director, *Aiven*
josep.p...@aiven.io   |   +491715557497
aiven.io    |   
     
*Aiven Deutschland GmbH*
Alexanderufer 3-7, 10117 Berlin
Geschäftsführer: Oskari Saarenmaa & Hannu Valtonen
Amtsgericht Charlottenburg, HRB 209739 B


[jira] [Created] (KAFKA-17185) Make sure a single logger instance is created

2024-07-23 Thread Chia-Ping Tsai (Jira)
Chia-Ping Tsai created KAFKA-17185:
--

 Summary: Make sure a single logger instance is created 
 Key: KAFKA-17185
 URL: https://issues.apache.org/jira/browse/KAFKA-17185
 Project: Kafka
  Issue Type: Improvement
Reporter: Chia-Ping Tsai
Assignee: Chia-Ping Tsai


the discussion: 
https://github.com/apache/kafka/pull/16657#discussion_r1686938593

In short, "private final logger" -> "private static final logger"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] Add blog post for 3.8 release [kafka-site]

2024-07-23 Thread via GitHub


stanislavkozlovski commented on code in PR #614:
URL: https://github.com/apache/kafka-site/pull/614#discussion_r1687822281


##
blog.html:
##
@@ -22,6 +22,94 @@
 
 
 Blog
+
+
+
+Apache 
Kafka 3.8.0 Release Announcement
+
+xx July 2024 - Josep Prat (https://twitter.com/jlprat";>@jlprat)
+We are proud to announce the release of Apache Kafka 3.8.0. 
This release contains many new features and improvements. This blog post will 
highlight some of the more prominent features. For a full list of changes, be 
sure to check the https://downloads.apache.org/kafka/3.8.0/RELEASE_NOTES.html";>release 
notes.
+See the https://kafka.apache.org/documentation.html#upgrade_3_8_0";>Upgrading to 
3.8.0 from any version 0.8.x through 3.6.x section in the documentation for 
the list of notable changes and detailed upgrade steps.
+
+In the last release, 3.6,
+https://kafka.apache.org/38/documentation.html#tiered_storage";>tiered 
storage was released as early availability feature.
+In this release, Tiered Storage now supports clusters 
configured with multiple log directories (i.e. JBOD feature).

Review Comment:
   @soarez - that's my understanding too



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (KAFKA-17186) Cannot receive message after stopping Source Mirror Maker 2

2024-07-23 Thread George Yang (Jira)
George Yang created KAFKA-17186:
---

 Summary: Cannot receive message after stopping Source Mirror Maker 
2
 Key: KAFKA-17186
 URL: https://issues.apache.org/jira/browse/KAFKA-17186
 Project: Kafka
  Issue Type: Bug
  Components: mirrormaker
Affects Versions: 3.7.1
 Environment: Source Kafka Cluster per Node:
CPU(s): 32
Memory: 32G/1.1G free

Target Kafka Cluster standalone Node:
CPU(s): 24
Memory: 30G/909M free

Kafka Version 3.7
Mirrormaker Version 3.7.1

Reporter: George Yang


Deploy nodes 1, 2, and 3 in Data Center A, with MM2 service deployed on node 1. 
Deploy node 1 in Data Center B, with MM2 service also deployed on node 1. 
Currently, a service on node 1 in Data Center A acts as a producer sending 
messages to the `myTest` topic. A service in Data Center B acts as a consumer 
listening to `A.myTest`. 

The issue arises when MM2 on node 1 in Data Center A is stopped: the consumer 
in Data Center B ceases to receive messages. Even after I restarting MM2 in 
Data Center A, the consumer in Data Center B still does not receive messages 
until approximately 5 minutes later when a rebalance occurs, at which point it 
begins receiving messages again.

 

[Logs From Consumer on Data Center B]

```log

[2024-07-23 17:29:17,270] INFO [MirrorCheckpointConnector|worker] refreshing 
consumer groups took 185 ms (org.apache.kafka.connect.mirror.Scheduler:95)
[2024-07-23 17:29:19,189] INFO [MirrorCheckpointConnector|worker] refreshing 
consumer groups took 365 ms (org.apache.kafka.connect.mirror.Scheduler:95)
[2024-07-23 17:29:22,271] INFO [MirrorCheckpointConnector|worker] refreshing 
consumer groups took 186 ms (org.apache.kafka.connect.mirror.Scheduler:95)
[2024-07-23 17:29:24,193] INFO [MirrorCheckpointConnector|worker] refreshing 
consumer groups took 369 ms (org.apache.kafka.connect.mirror.Scheduler:95)
[2024-07-23 17:29:25,377] INFO [Worker clientId=B->A, groupId=B-mm2] Rebalance 
started (org.apache.kafka.connect.runtime.distributed.WorkerCoordinator:242)
[2024-07-23 17:29:25,377] INFO [Worker clientId=B->A, groupId=B-mm2] 
(Re-)joining group 
(org.apache.kafka.connect.runtime.distributed.WorkerCoordinator:604)
[2024-07-23 17:29:25,386] INFO [Worker clientId=B->A, groupId=B-mm2] 
Successfully joined group with generation Generation\{generationId=52, 
memberId='B->A-adc19038-a8b6-40fb-9bf6-249f866944ab', protocol='sessioned'} 
(org.apache.kafka.connect.runtime.distributed.WorkerCoordinator:665)
[2024-07-23 17:29:25,390] INFO [Worker clientId=B->A, groupId=B-mm2] 
Successfully synced group in generation Generation\{generationId=52, 
memberId='B->A-adc19038-a8b6-40fb-9bf6-249f866944ab', protocol='sessioned'} 
(org.apache.kafka.connect.runtime.distributed.WorkerCoordinator:842)
[2024-07-23 17:29:25,390] INFO [Worker clientId=B->A, groupId=B-mm2] Joined 
group at generation 52 with protocol version 2 and got assignment: 
Assignment\{error=0, leader='B->A-adc19038-a8b6-40fb-9bf6-249f866944ab', 
leaderUrl='NOTUSED', offset=1360, connectorIds=[MirrorCheckpointConnector], 
taskIds=[MirrorCheckpointConnector-0, MirrorCheckpointConnector-1, 
MirrorCheckpointConnector-2], revokedConnectorIds=[], revokedTaskIds=[], 
delay=0} with rebalance delay: 0 
(org.apache.kafka.connect.runtime.distributed.DistributedHerder:2580)
[2024-07-23 17:29:25,390] INFO [Worker clientId=B->A, groupId=B-mm2] Starting 
connectors and tasks using config offset 1360 
(org.apache.kafka.connect.runtime.distributed.DistributedHerder:1921)
[2024-07-23 17:29:25,390] INFO [Worker clientId=B->A, groupId=B-mm2] Finished 
starting connectors and tasks 
(org.apache.kafka.connect.runtime.distributed.DistributedHerder:1950)
[2024-07-23 17:29:26,883] INFO [Worker clientId=A->B, groupId=A-mm2] Rebalance 
started (org.apache.kafka.connect.runtime.distributed.WorkerCoordinator:242)
[2024-07-23 17:29:26,883] INFO [Worker clientId=A->B, groupId=A-mm2] 
(Re-)joining group 
(org.apache.kafka.connect.runtime.distributed.WorkerCoordinator:604)
[2024-07-23 17:29:26,890] INFO [Worker clientId=A->B, groupId=A-mm2] 
Successfully joined group with generation Generation\{generationId=143, 
memberId='A->B-0d04e6c1-f12a-4121-89af-e9992a167a01', protocol='sessioned'} 
(org.apache.kafka.connect.runtime.distributed.WorkerCoordinator:665)
[2024-07-23 17:29:26,893] INFO [Worker clientId=A->B, groupId=A-mm2] 
Successfully synced group in generation Generation\{generationId=143, 
memberId='A->B-0d04e6c1-f12a-4121-89af-e9992a167a01', protocol='sessioned'} 
(org.apache.kafka.connect.runtime.distributed.WorkerCoordinator:842)


```

[Configuration]

```properties

# Run with ./bin/connect-mirror-maker.sh connect-mirror-maker.properties

name=MCS-MM2
# specify any number of cluster aliases
clusters = A, B

# connection information for each cluster
# This is a comma separated host:port pairs for each cluster
# for e.g. "A_host1:9092, A_host2:9092, A_host3:90

[jira] [Created] (KAFKA-17187) Add executor-AlterAcls to ClusterTestExtensions#SKIPPED_THREAD_PREFIX

2024-07-23 Thread PoAn Yang (Jira)
PoAn Yang created KAFKA-17187:
-

 Summary: Add executor-AlterAcls to 
ClusterTestExtensions#SKIPPED_THREAD_PREFIX
 Key: KAFKA-17187
 URL: https://issues.apache.org/jira/browse/KAFKA-17187
 Project: Kafka
  Issue Type: Sub-task
Reporter: PoAn Yang
Assignee: PoAn Yang


In CI [0], there is `executor-AlterAcls` thread leak. We should add it to 
ClusterTestExtensions#SKIPPED_THREAD_PREFIX [1] as we did in TestUtils#
verifyNoUnexpectedThreads [2].
 
[0] 
[https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-16661/1/testReport/]
[1] 
[https://github.com/apache/kafka/blob/a5bfc2190c3448039c9361909e547f64f7fdb6e2/core/src/test/java/kafka/test/junit/ClusterTestExtensions.java#L102]
[2] 
[https://github.com/apache/kafka/blob/a5bfc2190c3448039c9361909e547f64f7fdb6e2/core/src/test/scala/unit/kafka/utils/TestUtils.scala#L1918-L1927]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] KIP-1023: Follower fetch from tiered offset

2024-07-23 Thread Divij Vaidya
Thank you for your response Abhijeet. You have understood the scenario
correctly. For the purpose of discussion, please consider the latter case
where offset 11 is not available on the leader anymore (it got cleaned
locally since the last tiered offset is 15). In such a case, you
mentioned, the follower will eventually be able to catch up with the leader
by resetting its fetch offset until the offset is available on the leader's
local log. Correct me if I am wrong but it is not guaranteed that it will
eventually catch up because theoretically, everytime it asks for a newer
fetch offset, the leader may have deleted it locally. I understand that it
is an edge case scenario which will only happen with configurations for
small segment sizes and aggressive cleaning but nevertheless, it is a
possible scenario.

Do you agree that theoretically it is possible for the follower to loop
such that it is never able to catch up?

We can proceed with the KIP with an understanding that this scenario is
rare and we are willing to accept the risk of it. In such a case, we should
add a detection mechanism for such a scenario in the KIP, so that if we
encounter this scenario, the user has a way to detect (and mitigate it).
Alternatively, we can change the KIP design to ensure that we never
encounter this scenario. Given the rarity of the scenario, I am ok with
having a detection mechanism (metric?) in place and having this scenario
documented as an acceptable risk in current design.

--
Divij Vaidya



On Tue, Jul 23, 2024 at 11:55 AM Abhijeet Kumar 
wrote:

> Hi Divij,
>
> Seems like there is some confusion about the new protocol for fetching from
> tiered offset.
> The scenario you are highlighting is where,
> Leader's Log Start Offset = 0
> Last Tiered Offset = 10
>
> Following is the sequence of events that will happen:
>
> 1. Follower requests offset 0 from the leader
> 2. Assuming offset 0 is not available locally (to arrive at your scenario),
> Leader returns OffsetMovedToTieredStorageException
> 3. Follower fetches the earliest pending upload offset and receives 11
> 4. Follower builds aux state from [0-10] and sets the fetch offset to 11
> (This step corresponds to step 3 in your email)
>
> At this stage, even if the leader has uploaded more data and the
> last-tiered offset has changed (say to 15), it will not matter
> because offset 11 should still be available on the leader and when the
> follower requests data with fetch offset 11, the leader
> will return with a valid partition data response which the follower can
> consume and proceed further. Even if the offset 11 is not
> available anymore, the follower will eventually be able to catch up with
> the leader by resetting its fetch offset until the offset
> is available on the leader's local log. Once it catches up, replication on
> the follower can proceed.
>
> Regards,
> Abhijeet.
>
>
>
> On Tue, Jul 2, 2024 at 3:55 PM Divij Vaidya 
> wrote:
>
> > Hi folks.
> >
> > I am late to the party but I have a question on the proposal.
> >
> > How are we preventing a situation such as the following:
> >
> > 1. Empty follower asks leader for 0
> > 2. Leader compares 0 with last-tiered-offset, and responds with 11
> (where10
> > is last-tiered-offset) and a OffsetMovedToTieredException
> > 3. Follower builds aux state from [0-10] and sets the fetch offset to 11
> > 4. But leader has already uploaded more data and now the new
> > last-tiered-offset is 15
> > 5. Go back to 2
> >
> > This could cause a cycle where the replica will be stuck trying to
> > reconcile with the leader.
> >
> > --
> > Divij Vaidya
> >
> >
> >
> > On Fri, Apr 26, 2024 at 7:28 AM Abhijeet Kumar <
> abhijeet.cse@gmail.com
> > >
> > wrote:
> >
> > > Thank you all for your comments. As all the comments in the thread are
> > > addressed, I am starting a Vote thread for the KIP. Please have a look.
> > >
> > > Regards.
> > > Abhijeet.
> > >
> > >
> > >
> > > On Thu, Apr 25, 2024 at 6:08 PM Luke Chen  wrote:
> > >
> > > > Hi, Abhijeet,
> > > >
> > > > Thanks for the update.
> > > >
> > > > I have no more comments.
> > > >
> > > > Luke
> > > >
> > > > On Thu, Apr 25, 2024 at 4:21 AM Jun Rao 
> > > wrote:
> > > >
> > > > > Hi, Abhijeet,
> > > > >
> > > > > Thanks for the updated KIP. It looks good to me.
> > > > >
> > > > > Jun
> > > > >
> > > > > On Mon, Apr 22, 2024 at 12:08 PM Abhijeet Kumar <
> > > > > abhijeet.cse@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Jun,
> > > > > >
> > > > > > Please find my comments inline.
> > > > > >
> > > > > >
> > > > > > On Thu, Apr 18, 2024 at 3:26 AM Jun Rao  >
> > > > > wrote:
> > > > > >
> > > > > > > Hi, Abhijeet,
> > > > > > >
> > > > > > > Thanks for the reply.
> > > > > > >
> > > > > > > 1. I am wondering if we could achieve the same result by just
> > > > lowering
> > > > > > > local.retention.ms and local.retention.bytes. This also allows
> > the
> > > > > newly
> > > > > > > started follower to build up the local data before serving the
> > > >

Jenkins build is still unstable: Kafka » Kafka Branch Builder » trunk #3133

2024-07-23 Thread Apache Jenkins Server
See 




[jira] [Resolved] (KAFKA-17149) Move ProducerStateManagerTest to storage module

2024-07-23 Thread Chia-Ping Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-17149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chia-Ping Tsai resolved KAFKA-17149.

Fix Version/s: 3.9.0
   Resolution: Fixed

> Move ProducerStateManagerTest to storage module
> ---
>
> Key: KAFKA-17149
> URL: https://issues.apache.org/jira/browse/KAFKA-17149
> Project: Kafka
>  Issue Type: Test
>Reporter: Chia-Ping Tsai
>Assignee: 黃竣陽
>Priority: Major
> Fix For: 3.9.0
>
>
> ProducerStateManagerTest's related code are in storage module :smile



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] 3.8.0 RC3

2024-07-23 Thread Josep Prat
Here is the link to the system tests:
https://confluent-open-source-kafka-system-test-results.s3-us-west-2.amazonaws.com/3.8/2024-07-22--001.ffbb03b2-61f4-4ebb-ae1f-af5c753682fb--1721733000--confluentinc--3.8--9a2b34b68c/report.html

The Quota tests are known to fail in this CI system. Regarding the other
tests, they run successfully in the past and they are now timeouting.

Best,

On Tue, Jul 23, 2024 at 12:07 PM Josep Prat  wrote:

> Hello Kafka users, developers and client-developers,
>
> This is the fourth candidate for release of Apache Kafka 3.8.0.
>
> Some of the major features included in this release are:
> * KIP-1028: Docker Official Image for Apache Kafka
> * KIP-974: Docker Image for GraalVM based Native Kafka Broker
> * KIP-1036: Extend RecordDeserializationException exception
> * KIP-1019: Expose method to determine Metric Measurability
> * KIP-1004: Enforce tasks.max property in Kafka Connect
> * KIP-989: Improved StateStore Iterator metrics for detecting leaks
> * KIP-993: Allow restricting files accessed by File and Directory
> ConfigProviders
> * KIP-924: customizable task assignment for Streams
> * KIP-813: Shareable State Stores
> * KIP-719: Deprecate Log4J Appender
> * KIP-390: Support Compression Level
> * KIP-1018: Introduce max remote fetch timeout config for
> DelayedRemoteFetch requests
> * KIP-1037: Allow WriteTxnMarkers API with Alter Cluster Permission
> * KIP-1047 Introduce new org.apache.kafka.tools.api.Decoder to replace
> kafka.serializer.Decoder
> * KIP-899: Allow producer and consumer clients to rebootstrap
>
> Release notes for the 3.8.0 release:
> https://home.apache.org/~jlprat/kafka-3.8.0-rc3/RELEASE_NOTES.html
>
>  Please download, test and vote by Friday, July 26, 9am PT*
>
>
> Kafka's KEYS file containing PGP keys we use to sign the release:
> https://kafka.apache.org/KEYS
>
> * Release artifacts to be voted upon (source and binary):
> https://home.apache.org/~jlprat/kafka-3.8.0-rc3/
>
> * Docker release artifact to be voted upon:
> apache/kafka:3.8.0-rc3
> apache/kafka-native:3.8.0-rc3
>
> * Maven artifacts to be voted upon:
> https://repository.apache.org/content/groups/staging/org/apache/kafka/
>
> * Javadoc:
> https://home.apache.org/~jlprat/kafka-3.8.0-rc3/javadoc/
>
> * Tag to be voted upon (off 3.8 branch) is the 3.8.0 tag:
> https://github.com/apache/kafka/releases/tag/3.8.0-rc3
>
> * Documentation:
> https://kafka.apache.org/38/documentation.html
>
> * Protocol:
> https://kafka.apache.org/38/protocol.html
>
> * Successful Jenkins builds for the 3.8 branch:
> Unit/integration tests:
> https://ci-builds.apache.org/job/Kafka/job/kafka/job/3.8/73/
> System tests: Still running
>
> * Successful Docker Image Github Actions Pipeline for 3.8 branch:
> Docker Build Test Pipeline (JVM):
> https://github.com/apache/kafka/actions/runs/10055827182
> Docker Build Test Pipeline (Native):
> https://github.com/apache/kafka/actions/runs/10055829295
>
>
> Thanks,
>
> --
> [image: Aiven] 
>
> *Josep Prat*
> Open Source Engineering Director, *Aiven*
> josep.p...@aiven.io   |   +491715557497
> aiven.io    |
> 
>    
> *Aiven Deutschland GmbH*
> Alexanderufer 3-7, 10117 Berlin
> Geschäftsführer: Oskari Saarenmaa & Hannu Valtonen
> Amtsgericht Charlottenburg, HRB 209739 B
>


-- 
[image: Aiven] 

*Josep Prat*
Open Source Engineering Director, *Aiven*
josep.p...@aiven.io   |   +491715557497
aiven.io    |   
     
*Aiven Deutschland GmbH*
Alexanderufer 3-7, 10117 Berlin
Geschäftsführer: Oskari Saarenmaa & Hannu Valtonen
Amtsgericht Charlottenburg, HRB 209739 B


Re: [DISCUSS] KIP-512: make Record Headers available in onAcknowledgement

2024-07-23 Thread Kevin Lam
Hi,

Thanks for starting the discussion. Latency Measurement and Tracing
Completeness are both good reasons to support this feature, and would be
interested to see this move forward.

On Mon, Jul 22, 2024 at 11:15 PM Rich C.  wrote:

> Hi Everyone,
>
> I hope this email finds you well.
>
> I would like to start a discussion on KIP-512. The initial version of
> KIP-512 was created in 2019, and I have resurrected it in 2024 with more
> details about the motivation behind it.
>
> You can view the current version of the KIP here: KIP-512: Make Record
> Headers Available in onAcknowledgement.
> <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-512%3A+make+Record+Headers+available+in+onAcknowledgement
> >
>
> Let's focus on discussing the necessity of this feature first. If we agree
> on its importance, we can then move on to discussing the proposed changes.
>
> Looking forward to your feedback.
>
> Best regards,
> Rich
>


Re: [DISCUSS] KIP-1071: Streams Rebalance Protocol

2024-07-23 Thread Bruno Cadonna

Hi Lianet,

Thanks for the review!

Here my answers:

LM1. Is your question whether we need to send a full heartbeat each time 
the member re-joins the group even if the information in the RPC did not 
change since the last heartbeat?


LM2. Is the reason for sending the instance ID each time that a member 
could shutdown, change the instance ID and then start and heartbeat 
again, but the group coordinator would never notice that the instance ID 
changed?


LM3. I see your point. I am wondering whether this additional 
information is worth the dependency between the group types. To return 
INVALID_GROUP_TYPE, the group coordinator needs to know that a group ID 
exists with a different group type. With a group coordinator as we have 
it now in Apache Kafka that manages all group types, that is not a big 
deal, but imagine if we (or some implementation of the Apache Kafka 
protocol) decide to have a separate group coordinator for each group type.


LM4. Using INVALID_GROUP_ID if the group ID is empty makes sense to me. 
I going to change that.


LM5. I think there is a dependency from the StreamsGroupInitialize RPC 
to the heartbeat. The group must exist when the initialize RPC is 
received by the group coordinator. The group is created by the heartbeat 
RPC. I would be in favor of making the initialize RPC independent from 
the heartbeat RPC. That would allow to initialize a streams group 
explicitly with an admin tool.


LM6. I think it affects streams and streams should behave as the 
consumer group.


LM7. Good point that we will consider.

LM8. Fixed! Thanks!


Best,
Bruno




On 7/19/24 9:53 PM, Lianet M. wrote:

Hi Lucas/Bruno, thanks for the great KIP! First comments:

LM1. Related to where the KIP says:  *“Group ID, member ID, member epoch
are sent with each heartbeat request. Any other information that has not
changed since the last heartbeat can be omitted.”. *I expect all the other
info also needs to be sent whenever a full heartbeat is required (even if
it didn’t change from the last heartbeat), ex. on fencing scenarios,
correct?

LM2. For consumer groups we always send the groupInstanceId (if any) as
part of every heartbeat, along with memberId, epoch and groupId. Should we
consider that too here?

LM3. We’re proposing returning a GROUP_ID_NOT_FOUND error in response to
the stream-specific RPCs if the groupId is associated with a group type
that is not streams (ie. consumer group or share group). I wonder if at
this point, where we're getting several new group types added, each with
RPCs that are supposed to include groupId of a certain type, we should be
more explicit about this situation. Maybe a kind of INVALID_GROUP_TYPE
(group exists but not with a valid type for this RPC) vs a
GROUP_ID_NOT_FOUND (group does not exist).  Those errors would be
consistently used across consumer, share, and streams RPCs whenever the
group id is not of the expected type.
This is truly not specific to this KIP, and should be addressed with all
group types and their RPCs in mind. I just wanted to bring out my concern
and get thoughts around it.

LM4. On a related note, StreamsGroupDescribe returns INVALID_REQUEST if
groupId is empty. There is already an INVALID_GROUP_ID error, that seems
more specific to this situation. Error handling of specific errors would
definitely be easier than having to deal with a generic INVALID_REQUEST
(and probably its custom message). I know that for KIP-848 we have
INVALID_REQUEST for similar situations, so if ever we take down this path
we should review it there too for consistency. Thoughts?

LM5. The dependency between the StreamsGroupHeartbeat RPC and the
StreamsGroupInitialize RPC is one-way only right? HB requires a previous
StreamsGroupInitialize request, but StreamsGroupInitialize processing is
totally independent of heartbeats (and could perfectly be processed without
a previous HB, even though the client implementation we’re proposing won’t
go down that path). Is my understanding correct? Just to double check,
seems sensible like that at the protocol level.

LM6. With KIP-848, there is an important improvement that brings a
difference in behaviour around the static membership: with the classic
protocol, if a static member joins with a group instance already in use, it
makes the initial member fail with a FENCED_INSTANCED_ID exception, vs.
with the new consumer group protocol, the second member trying to join
fails with an UNRELEASED_INSTANCE_ID. Does this change need to be
considered in any way for the streams app? (I'm not familiar with KS yet,
but thought it was worth asking. If it doesn't affect in any way, still
maybe helpful to call it out on a section for static membership)

LM7. Regarding the admin tool to manage streams groups. We can discuss
whether to have it here or separately, but I think we should aim for some
basic admin capabilities from the start, mainly because I believe it will
be very helpful/needed in practice during the impl of the KIP. From
experie

[jira] [Created] (KAFKA-17188) LoginManager ctro might throw an exception causing login and loginCallbackHandler not being closed properly

2024-07-23 Thread Philip Nee (Jira)
Philip Nee created KAFKA-17188:
--

 Summary: LoginManager ctro might throw an exception causing login 
and loginCallbackHandler not being closed properly
 Key: KAFKA-17188
 URL: https://issues.apache.org/jira/browse/KAFKA-17188
 Project: Kafka
  Issue Type: Bug
  Components: security
Reporter: Philip Nee
Assignee: Philip Nee


When using MDS login, loginManager.login() may throw an exception causing login 
and loginCallbackHandler not being closed properly.  We should catch exception 
and close login and loginCallbackHandler.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] KIP-1068: KIP-1068: New JMX Metrics for AsyncKafkaConsumer

2024-07-23 Thread Bruno Cadonna

Hi Brenden,

BC1. In his first e-mail Andrew wrote "I would expect that the metrics 
do not exist at all". I agree with him. I think it would be better to 
not add those metrics at all if the CLASSIC protocol is used rather than 
the metrics exist and are all constant 0. This should be possible by not 
adding the metrics to the metrics registry if the CONSUMER protocol is 
not used.


BC2. Is there a specific reason you do not propose 
background-event-queue-time-max and background-event-queue-time-avg? If 
folk think those are not useful we do not need to add them. However, if 
those are not useful, why is background-event-queue-size useful. I was 
just wondering about the asymmetry between background-event-queue and 
application-event-queue.


Best,
Bruno



On 7/19/24 9:14 PM, Brenden Deluna wrote:

Hi Apoorv,
Thank you for your comments, I will address each.

AM1. I can see the usefulness in also having an
'application-event-queue-age-max' to get an idea of outliers and how they
may be affecting the average metric. I will add that.

AM2. I agree with you there, I think 'time' is a better descriptor here
than 'age'. I will update those metric names as well.

AM3. Similar to above comments, I will change the name of that metric to be
more consistent. And I think a max metric would also be useful here, adding
that.

AM4. Yes, good catch there. Will update that as well.

Thank you,
Brenden

On Fri, Jul 19, 2024 at 8:14 AM Apoorv Mittal 
wrote:


Hi Brendan,
Thanks for the KIP. The metrics are always helpful.

AM1: Is `application-event-queue-age-avg` enough or do we require `
application-event-queue-age-max` as well to differentiate with outliers?

AM2: The kafka producer defines metric `record-queue-time-avg` which
captures the time spent in the buffer. Do you think we should have a
similar name for `application-event-queue-age-avg` i.e. change to `
application-event-queue-time-avg`? Moreover other than similar naming,
`time` anyways seems more suitable than `age`, though minor. The `time`
usage is also aligned with the description of this metric.

AM3: Metric `application-event-processing-time` says "the average time,
that the consumer network.". Shall we have the `-avg` suffix in the
metric as we have defined for other metrics? Also do we require the max
metric as well for the same?

AM4: Is the telemetry name for `unsent-requests-queue-size` intended
as `org.apache.kafka.consumer.unsent.requests.size`,
or it should be corrected to `
org.apache.kafka.consumer.unsent.requests.queue.size`?

AM2:
Regards,
Apoorv Mittal
+44 7721681581


On Mon, Jul 15, 2024 at 2:45 PM Andrew Schofield <
andrew_schofi...@live.com>
wrote:


Hi Brenden,
Thanks for the updates.

AS4. I see that you’ve added `.ms` to a bunch of the metrics reflecting

the

fact that they’re measured in milliseconds. However, I observe that most
metrics
in Kafka that are measured in milliseconds, with some exceptions in Kafka
Connect
and MirrorMaker do not follow this convention. I would tend to err on the
side of
consistency with the existing metrics and not use `.ms`. However, that’s
just my
opinion, so I’d be interested to know what other reviewers of the KIP
think.

Thanks,
Andrew


On 12 Jul 2024, at 20:11, Brenden Deluna 

wrote:


Hey Lianet,

Thank you for your suggestions and feedback!


LM1. This has now been addressed.


LM2. I think that would be a valuable addition to the current set of
metrics, I will get that added.


LM3. Again great idea, that would certainly be helpful. Will add that

as

well.


Let me know if you have any more suggestions!


Thanks,

Brenden

On Fri, Jul 12, 2024 at 2:11 PM Brenden Deluna 

wrote:



Hi Lucas,

Thank you for the feedback! I have addressed your comments:


LB1. Good catch there, I will update the names as needed.


LB2. Good catch again! I will update the name to be more consistent.


LB3. Thank you for pointing this out, I realized that all metric

values

will actually be set to 0. I will specifiy this and explain why they

will

be 0.


Nit: This metric is referring to the queue of unsent requests in the
NetworkClientDelegate. For the metric descriptions I am trying to not
include too much of the implementation details, hence the reason that
description is quite short. I cannot think of other ways to describe

the

metric without going deeper into the implementation, but please do let

me

know if you have any ideas.


Thank you,

Brenden

On Fri, Jul 12, 2024 at 1:27 PM Lianet M.  wrote:


Hey Brenden, thanks for the KIP! Great to get more visibility into

the

new

consumer.

LM1. +1 on Lucas's suggestion for including the unit in the name,

seems

clearer and consistent (I do see several time metrics including ms)

LM2. What about a new metric for application-event-queue-time-ms. It

would

be a complement to the application-event-queue-size you're proposing,

and

it will tell us how long the events sit in the queue, waiting to be
processed (from the time the API call adds the event

[jira] [Created] (KAFKA-17189) Add Thread::isAlive to DetectThreadLeak#of filter

2024-07-23 Thread PoAn Yang (Jira)
PoAn Yang created KAFKA-17189:
-

 Summary: Add Thread::isAlive to DetectThreadLeak#of filter
 Key: KAFKA-17189
 URL: https://issues.apache.org/jira/browse/KAFKA-17189
 Project: Kafka
  Issue Type: Sub-task
Reporter: PoAn Yang
Assignee: PoAn Yang


Although we have KAFKA-17159 to close kafka-cluster-test-kit-executor, we still 
have many thread leak when removing `waitForCondition` [0].

 

Adding log to show whether thread is alive when showing "Thread leak detected" 
message, I found that detected thread is not alive. We may need to consider 
adding `Thread::isAlive` to `DetectThreadLeak#of` filter.

 

[0] 
[https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-16661/1/testReport/]
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] KIP-939: Support Participation in 2PC

2024-07-23 Thread Artem Livshits
Thanks everyone! I'm going to close the vote.

Martijn Visser +1
Justine Olshan +1 (binding)
Omnia Ibrahim +1
Jason Gustafson +1 (binding)
Jun Rao +1 (binding)

The KIP is accepted.

-Artem

On Mon, Jul 22, 2024 at 10:57 AM Jun Rao  wrote:

> Hi, Artem,
>
> Thanks for the KIP. +1 from me.
>
> Jun
>
> On Wed, May 29, 2024 at 9:41 AM Jason Gustafson  >
> wrote:
>
> > +1 Thanks for the KIP!
> >
> > On Wed, Mar 13, 2024 at 5:13 AM Omnia Ibrahim 
> > wrote:
> >
> > > I had a look at the discussion thread and the KIP looks exciting.
> > > +1 non-binding
> > >
> > > Best
> > > Omnia
> > >
> > > On 1 Dec 2023, at 19:06, Artem Livshits  .INVALID>
> > > wrote:
> > >
> > > Hello,
> > >
> > > This is a voting thread for
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-939%3A+Support+Participation+in+2PC
> > > .
> > >
> > > The KIP proposes extending Kafka transaction support (that already uses
> > 2PC
> > > under the hood) to enable atomicity of dual writes to Kafka and an
> > external
> > > database, and helps to fix a long standing Flink issue.
> > >
> > > An example of code that uses the dual write recipe with JDBC and should
> > > work for most SQL databases is here
> > > https://github.com/apache/kafka/pull/14231.
> > >
> > > The FLIP for the sister fix in Flink is here
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=255071710
> > >
> > > -Artem
> > >
> >
>


Re: [DISCUSS] KIP-1043: Administration of groups

2024-07-23 Thread Lianet M.
Hello Andrew,

Bringing here the point I surfaced on the KIP-1071 thread:

I wonder if at this point, where we're getting several new group types
> added, each with RPCs that are supposed to include groupId of a certain
> type, we should be more explicit about this situation. Maybe a kind of
> INVALID_GROUP_TYPE (group exists but not with a valid type for this RPC) vs
> a GROUP_ID_NOT_FOUND (group does not exist).  Those errors would be
> consistently used across consumer, share, and streams RPCs whenever the
> group id is not of the expected type.


I noticed it on KIP-1071 but totally agree with you that it would make more
sense to consider it here.

LM9. Regarding the point of introducing a new INVALID_GROUP_TYPE vs reusing
the existing INCONSISTENT_PROTOCOL_TYPE. My concern with reusing
INCONSISTENT_GROUP_PROTOCOL for errors with the group ID is that it mixes
the concepts of group type and protocol. Even though they are closely
related, we have 2 separate concepts (internally and presented in output
for commands), and the relationship is not 1-1 in all cases. Also, the
INCONSISTENT_GROUP_PROTOCOL is already used not only for protocol but also
when validating the list of assignors provided by a consumer in a
JoinGroupRequest. Seems a bit confusing to me already, so maybe better not
to add more to it? Just first thoughts. What do you think?

Thanks,
Lianet

On Fri, Jul 19, 2024 at 5:00 AM Andrew Schofield 
wrote:

> Hi Apoorv,
> Thanks for your comments.
>
> AM1: I chose to leave the majority of the administration for the different
> types of groups in their own tools. The differences between the group
> types are significant and I think that one uber tool that subsumes
> kafka-consumer-groups.sh, kafka-share-groups.sh and
> kafka-streams-application-reset.sh would be too overwhelming and
> difficult to use. For example, the output from describing a consumer group
> is not the same as the output from describing a share group.
>
> AM2: I think you’re highlighting some of the effects of the evolution
> of groups. The classic consumer group protocol defined the idea
> of protocol as a way of distinguishing between the various ways people
> had extended the base protocol - “consumer", “connect", and “sr" are the
> main ones I’ve seen, and the special “” for groups that are not using
> member assignment.
>
> For the modern group protocol, each of the proposed implementations
> brings its own use of the protocol string - “consumer”, “share” and
> “streams”.
>
> Now, prior to AK 4.0, in order to make the console consumer use the
> new group protocol, you set `--consumer-property group.protocol=consumer`.
> This tells a factory method in the consumer to use the AsyncKafkaConsumer
> (group type is Consumer, protocol is “consumer") as opposed to the
> LegacyKafkaConsumer (group type is Classic, protocol is “consumer”).
> In AK 4.0, the default group protocol will change and setting the property
> will not be necessary. The name of the configuration “group.protocol”
> is slightly misleading. In practice, this is most likely to be used pre-AK
> 4.0
> by people wanting to try out the new consumer.
>
> AM3: When you try to create a share group and that group ID is already
> in use by another type of group, the error message is “Group CG1 is not
> a share group”. It exists already, with the wrong type.
>
> AM4: This KIP changes the error behaviour for `kafka-consumer-groups.sh`
> and `kafka-share-groups.sh` such that any operation on a group that finds
> the
> group type is incorrect reports “Error: Group XXX is not a consumer group”
> or
> equivalent for the other group types. This change makes things much easier
> to
> understand than they are today.
>
> AM5: That section is just clarifying what the behaviour is. I don’t think
> it had
> been written down before.
>
> Thanks,
> Andrew
>
> > On 18 Jul 2024, at 16:43, Apoorv Mittal 
> wrote:
> >
> > Hi Andrew,
> > Thanks for the KIP. The group administration is getting difficult with
> new
> > types of groups being added and certainly the proposal looks great.
> > I have some questions:
> >
> > AM1: The current proposal defines the behaviour for listing and
> describing
> > groups, simplifying create for `kafka-share-groups.sh`. Do we want to
> leave
> > the other group administration like delete to respective groups or shall
> > have common behaviour defined i.e. leave to respective
> > kafka-consumer-groups.sh or kafka-share-groups.sh?
> >
> > AM2: Reading the notes on KIP-848,
> >
> https://cwiki.apache.org/confluence/display/KAFKA/The+Next+Generation+of+the+Consumer+Rebalance+Protocol+%28KIP-848%29+-+Early+Access+Release+Notes
> ,
> > which requires `--consumer-property group.protocol=consumer` to enable
> > modern consumer group. But the listing for `classic` "type" also defines
> > "protocol" as `consumer` in some scenarios. Is it intended or `classic`
> > type should different protocol?
> >
> > AM3: The KIP adds behaviour for  `kafka-share-groups.sh` which define

Re: [DISCUSS] KIP-1061: Allow exporting SCRAM credentials

2024-07-23 Thread kafka
Hi Everyone,

Bumping this thread to get some feedback on the proposal.

If there are no concerns then I'll request a vote in a couple of weeks.

Regards,
Gaurav

> On 24 Jun 2024, at 09:48, ka...@gnarula.com wrote:
> 
> Hi all,
> 
> I'd like to kick off discussion for KIP-1061 which proposes a way to export 
> SCRAM credentials:
> 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1061%3A+Allow+exporting+SCRAM+credentials
> 
> Please have a look. Looking forward to hearing your thoughts!
> 
> Regards,
> Gaurav



[jira] [Resolved] (KAFKA-16068) Use TestPlugins in ConnectorValidationIntegrationTest to silence plugin scanning errors

2024-07-23 Thread Chris Egerton (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Egerton resolved KAFKA-16068.
---
Fix Version/s: 3.9.0
   Resolution: Fixed

> Use TestPlugins in ConnectorValidationIntegrationTest to silence plugin 
> scanning errors
> ---
>
> Key: KAFKA-16068
> URL: https://issues.apache.org/jira/browse/KAFKA-16068
> Project: Kafka
>  Issue Type: Task
>  Components: connect
>Reporter: Greg Harris
>Assignee: Chris Egerton
>Priority: Minor
>  Labels: newbie++
> Fix For: 3.9.0
>
>
> The ConnectorValidationIntegrationTest creates test plugins, some with 
> erroneous behavior. In particular:
>  
> {noformat}
> [2023-12-29 10:28:06,548] ERROR Failed to discover Converter in classpath: 
> Unable to instantiate TestConverterWithPrivateConstructor: Plugin class 
> default constructor must be public 
> (org.apache.kafka.connect.runtime.isolation.ReflectionScanner:138) 
> [2023-12-29 10:28:06,550]
> ERROR Failed to discover Converter in classpath: Unable to instantiate 
> TestConverterWithConstructorThatThrowsException: Failed to invoke plugin 
> constructor (org.apache.kafka.connect.runtime.isolation.ReflectionScanner:138)
> java.lang.reflect.InvocationTargetException{noformat}
> These plugins should be eliminated from the classpath, so that the errors do 
> not appear in unrelated tests. Instead, plugins with erroneous behavior 
> should only be present in the TestPlugins, so that tests can opt-in to 
> loading them.
> There are already plugins with private constructors and 
> throwing-exceptions-constructors, so they should be able to be re-used.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Fwd: [VOTE] 3.8.0 RC3

2024-07-23 Thread Chris Egerton
Forwarding my response to the other mailing list threads; apologies for
missing the reply-all the first time!

-- Forwarded message -
From: Chris Egerton 
Date: Tue, Jul 23, 2024 at 3:45 PM
Subject: Re: [VOTE] 3.8.0 RC3
To: 


Hi Josep,

Thanks for running this release! I'm +1 (binding).


To verify, I:
- Built from source using Java 11 with both:
- - the 3.8.0-rc3 tag on GitHub
- - the source artifact from
https://home.apache.org/~jlprat/kafka-3.8.0-rc3/kafka-3.8.0-src.tgz
- Checked signatures and checksums
- Ran the quickstart using both:
- - The build artifact from
https://home.apache.org/~jlprat/kafka-3.8.0-rc3/kafka_2.13-3.8.0.tgz with
Java 11 and Scala 13 in KRaft mode
- - Our JVM-based broker Docker image, apache/kafka:3.8.0-rc3
- Ran all unit tests
- Ran all integration tests for Connect and MM2


A few small, non-blocking notes:
1) The release notes categorize KAFKA-16445 [1] as an improvement, but I
believe it should be listed as a new feature instead.
2) The following unit tests failed the first time around, but passed when
run a second time:
- (clients) SaslAuthenticatorTest.testMissingUsernameSaslPlain()
- (core) ProducerIdManagerTest.testUnrecoverableErrors(UNKNOWN_SERVER_ERROR)
- (core) RemoteLogManagerTest.testCopyQuota(false)
-
(core) 
SocketServerTest.testClientDisconnectionWithOutstandingReceivesProcessedUntilFailedSend()
- (core) ZkMigrationIntegrationTest.testMigrateTopicDeletions [7] Type=ZK,
MetadataVersion=3.7-IV4,Security=PLAINTEXT
  - This is also not actually a unit test, but an integration test. Looks
like we haven't classified it correctly?


[1] - https://issues.apache.org/jira/browse/KAFKA-16445

Cheers,

Chris

On Tue, Jul 23, 2024 at 8:42 AM Josep Prat 
wrote:

> Here is the link to the system tests:
>
> https://confluent-open-source-kafka-system-test-results.s3-us-west-2.amazonaws.com/3.8/2024-07-22--001.ffbb03b2-61f4-4ebb-ae1f-af5c753682fb--1721733000--confluentinc--3.8--9a2b34b68c/report.html
>
> The Quota tests are known to fail in this CI system. Regarding the other
> tests, they run successfully in the past and they are now timeouting.
>
> Best,
>
> On Tue, Jul 23, 2024 at 12:07 PM Josep Prat  wrote:
>
> > Hello Kafka users, developers and client-developers,
> >
> > This is the fourth candidate for release of Apache Kafka 3.8.0.
> >
> > Some of the major features included in this release are:
> > * KIP-1028: Docker Official Image for Apache Kafka
> > * KIP-974: Docker Image for GraalVM based Native Kafka Broker
> > * KIP-1036: Extend RecordDeserializationException exception
> > * KIP-1019: Expose method to determine Metric Measurability
> > * KIP-1004: Enforce tasks.max property in Kafka Connect
> > * KIP-989: Improved StateStore Iterator metrics for detecting leaks
> > * KIP-993: Allow restricting files accessed by File and Directory
> > ConfigProviders
> > * KIP-924: customizable task assignment for Streams
> > * KIP-813: Shareable State Stores
> > * KIP-719: Deprecate Log4J Appender
> > * KIP-390: Support Compression Level
> > * KIP-1018: Introduce max remote fetch timeout config for
> > DelayedRemoteFetch requests
> > * KIP-1037: Allow WriteTxnMarkers API with Alter Cluster Permission
> > * KIP-1047 Introduce new org.apache.kafka.tools.api.Decoder to replace
> > kafka.serializer.Decoder
> > * KIP-899: Allow producer and consumer clients to rebootstrap
> >
> > Release notes for the 3.8.0 release:
> > https://home.apache.org/~jlprat/kafka-3.8.0-rc3/RELEASE_NOTES.html
> >
> >  Please download, test and vote by Friday, July 26, 9am PT*
> >
> >
> > Kafka's KEYS file containing PGP keys we use to sign the release:
> > https://kafka.apache.org/KEYS
> >
> > * Release artifacts to be voted upon (source and binary):
> > https://home.apache.org/~jlprat/kafka-3.8.0-rc3/
> >
> > * Docker release artifact to be voted upon:
> > apache/kafka:3.8.0-rc3
> > apache/kafka-native:3.8.0-rc3
> >
> > * Maven artifacts to be voted upon:
> > https://repository.apache.org/content/groups/staging/org/apache/kafka/
> >
> > * Javadoc:
> > https://home.apache.org/~jlprat/kafka-3.8.0-rc3/javadoc/
> >
> > * Tag to be voted upon (off 3.8 branch) is the 3.8.0 tag:
> > https://github.com/apache/kafka/releases/tag/3.8.0-rc3
> >
> > * Documentation:
> > https://kafka.apache.org/38/documentation.html
> >
> > * Protocol:
> > https://kafka.apache.org/38/protocol.html
> >
> > * Successful Jenkins builds for the 3.8 branch:
> > Unit/integration tests:
> > https://ci-builds.apache.org/job/Kafka/job/kafka/job/3.8/73/
> > System tests: Still running
> >
> > * Successful Docker Image Github Actions Pipeline for 3.8 branch:
> > Docker Build Test Pipeline (JVM):
> > https://github.com/apache/kafka/actions/runs/10055827182
> > Docker Build Test Pipeline (Native):
> > https://github.com/apache/kafka/actions/runs/10055829295
> >
> >
> > Thanks,
> >
> > --
> > [image: Aiven] 
> >
> > *Josep Prat*
> > Open Source Engineering Director, *Aiven*
> > josep.p...@aiven.io   |   +491

Re: [DISCUSS] KIP-1068: KIP-1068: New JMX Metrics for AsyncKafkaConsumer

2024-07-23 Thread Brenden Deluna
Hi Bruno,
Thank you for your comments.

BC1. I think that is a great point, I was not aware that we could not add
the metrics based on which protocol is being used. I will update the KIP to
reflect that change.

BC2. There is not any specific reason for this, really it has just never
been suggested in this thread. I will add them to get some opinions on if
those would be useful and will go with the group consensus.

Thank you,
Brenden

On Tue, Jul 23, 2024 at 11:03 AM Bruno Cadonna  wrote:

> Hi Brenden,
>
> BC1. In his first e-mail Andrew wrote "I would expect that the metrics
> do not exist at all". I agree with him. I think it would be better to
> not add those metrics at all if the CLASSIC protocol is used rather than
> the metrics exist and are all constant 0. This should be possible by not
> adding the metrics to the metrics registry if the CONSUMER protocol is
> not used.
>
> BC2. Is there a specific reason you do not propose
> background-event-queue-time-max and background-event-queue-time-avg? If
> folk think those are not useful we do not need to add them. However, if
> those are not useful, why is background-event-queue-size useful. I was
> just wondering about the asymmetry between background-event-queue and
> application-event-queue.
>
> Best,
> Bruno
>
>
>
> On 7/19/24 9:14 PM, Brenden Deluna wrote:
> > Hi Apoorv,
> > Thank you for your comments, I will address each.
> >
> > AM1. I can see the usefulness in also having an
> > 'application-event-queue-age-max' to get an idea of outliers and how they
> > may be affecting the average metric. I will add that.
> >
> > AM2. I agree with you there, I think 'time' is a better descriptor here
> > than 'age'. I will update those metric names as well.
> >
> > AM3. Similar to above comments, I will change the name of that metric to
> be
> > more consistent. And I think a max metric would also be useful here,
> adding
> > that.
> >
> > AM4. Yes, good catch there. Will update that as well.
> >
> > Thank you,
> > Brenden
> >
> > On Fri, Jul 19, 2024 at 8:14 AM Apoorv Mittal 
> > wrote:
> >
> >> Hi Brendan,
> >> Thanks for the KIP. The metrics are always helpful.
> >>
> >> AM1: Is `application-event-queue-age-avg` enough or do we require `
> >> application-event-queue-age-max` as well to differentiate with outliers?
> >>
> >> AM2: The kafka producer defines metric `record-queue-time-avg` which
> >> captures the time spent in the buffer. Do you think we should have a
> >> similar name for `application-event-queue-age-avg` i.e. change to `
> >> application-event-queue-time-avg`? Moreover other than similar naming,
> >> `time` anyways seems more suitable than `age`, though minor. The `time`
> >> usage is also aligned with the description of this metric.
> >>
> >> AM3: Metric `application-event-processing-time` says "the average time,
> >> that the consumer network.". Shall we have the `-avg` suffix in the
> >> metric as we have defined for other metrics? Also do we require the max
> >> metric as well for the same?
> >>
> >> AM4: Is the telemetry name for `unsent-requests-queue-size` intended
> >> as `org.apache.kafka.consumer.unsent.requests.size`,
> >> or it should be corrected to `
> >> org.apache.kafka.consumer.unsent.requests.queue.size`?
> >>
> >> AM2:
> >> Regards,
> >> Apoorv Mittal
> >> +44 7721681581
> >>
> >>
> >> On Mon, Jul 15, 2024 at 2:45 PM Andrew Schofield <
> >> andrew_schofi...@live.com>
> >> wrote:
> >>
> >>> Hi Brenden,
> >>> Thanks for the updates.
> >>>
> >>> AS4. I see that you’ve added `.ms` to a bunch of the metrics reflecting
> >> the
> >>> fact that they’re measured in milliseconds. However, I observe that
> most
> >>> metrics
> >>> in Kafka that are measured in milliseconds, with some exceptions in
> Kafka
> >>> Connect
> >>> and MirrorMaker do not follow this convention. I would tend to err on
> the
> >>> side of
> >>> consistency with the existing metrics and not use `.ms`. However,
> that’s
> >>> just my
> >>> opinion, so I’d be interested to know what other reviewers of the KIP
> >>> think.
> >>>
> >>> Thanks,
> >>> Andrew
> >>>
>  On 12 Jul 2024, at 20:11, Brenden Deluna  >>>
> >>> wrote:
> 
>  Hey Lianet,
> 
>  Thank you for your suggestions and feedback!
> 
> 
>  LM1. This has now been addressed.
> 
> 
>  LM2. I think that would be a valuable addition to the current set of
>  metrics, I will get that added.
> 
> 
>  LM3. Again great idea, that would certainly be helpful. Will add that
> >> as
>  well.
> 
> 
>  Let me know if you have any more suggestions!
> 
> 
>  Thanks,
> 
>  Brenden
> 
>  On Fri, Jul 12, 2024 at 2:11 PM Brenden Deluna 
> >>> wrote:
> 
> > Hi Lucas,
> >
> > Thank you for the feedback! I have addressed your comments:
> >
> >
> > LB1. Good catch there, I will update the names as needed.
> >
> >
> > LB2. Good catch again! I will update the name to be more consistent

Re: [DISCUSS] KIP-1068: KIP-1068: New JMX Metrics for AsyncKafkaConsumer

2024-07-23 Thread Lianet M.
Hey all,

Follow-up on Bruno's point BC2. I personally did not suggest
background-event-queue-time-max and background-event-queue-time-avg mainly
because in practice we only have 2 types of events flowing from the
background to the app thread: errorEvents and callbackEvents, (vs the many
api-triggered events that flow from the app thread to the background). If
we get deeper into those, the error events are actually very light (from
the processing point of view), and the callbacks events already have their
specific metrics (inherited from the legacy consumer). So that was my
reasoning for not jumping right away into metrics for the background events
queue.

That being said, I do like the symmetry Bruno mentions, and see the value
in having visibility on both queues. The app queue would probably be the
busiest but we need to consider that the background events set may grow as
we integrate other features like Queues (ex. with its ack commits
callbacks), so ok with me if we prefer to have a view on both queues.

Lianet


On Tue, Jul 23, 2024 at 4:04 PM Brenden Deluna 
wrote:

> Hi Bruno,
> Thank you for your comments.
>
> BC1. I think that is a great point, I was not aware that we could not add
> the metrics based on which protocol is being used. I will update the KIP to
> reflect that change.
>
> BC2. There is not any specific reason for this, really it has just never
> been suggested in this thread. I will add them to get some opinions on if
> those would be useful and will go with the group consensus.
>
> Thank you,
> Brenden
>
> On Tue, Jul 23, 2024 at 11:03 AM Bruno Cadonna  wrote:
>
> > Hi Brenden,
> >
> > BC1. In his first e-mail Andrew wrote "I would expect that the metrics
> > do not exist at all". I agree with him. I think it would be better to
> > not add those metrics at all if the CLASSIC protocol is used rather than
> > the metrics exist and are all constant 0. This should be possible by not
> > adding the metrics to the metrics registry if the CONSUMER protocol is
> > not used.
> >
> > BC2. Is there a specific reason you do not propose
> > background-event-queue-time-max and background-event-queue-time-avg? If
> > folk think those are not useful we do not need to add them. However, if
> > those are not useful, why is background-event-queue-size useful. I was
> > just wondering about the asymmetry between background-event-queue and
> > application-event-queue.
> >
> > Best,
> > Bruno
> >
> >
> >
> > On 7/19/24 9:14 PM, Brenden Deluna wrote:
> > > Hi Apoorv,
> > > Thank you for your comments, I will address each.
> > >
> > > AM1. I can see the usefulness in also having an
> > > 'application-event-queue-age-max' to get an idea of outliers and how
> they
> > > may be affecting the average metric. I will add that.
> > >
> > > AM2. I agree with you there, I think 'time' is a better descriptor here
> > > than 'age'. I will update those metric names as well.
> > >
> > > AM3. Similar to above comments, I will change the name of that metric
> to
> > be
> > > more consistent. And I think a max metric would also be useful here,
> > adding
> > > that.
> > >
> > > AM4. Yes, good catch there. Will update that as well.
> > >
> > > Thank you,
> > > Brenden
> > >
> > > On Fri, Jul 19, 2024 at 8:14 AM Apoorv Mittal <
> apoorvmitta...@gmail.com>
> > > wrote:
> > >
> > >> Hi Brendan,
> > >> Thanks for the KIP. The metrics are always helpful.
> > >>
> > >> AM1: Is `application-event-queue-age-avg` enough or do we require `
> > >> application-event-queue-age-max` as well to differentiate with
> outliers?
> > >>
> > >> AM2: The kafka producer defines metric `record-queue-time-avg` which
> > >> captures the time spent in the buffer. Do you think we should have a
> > >> similar name for `application-event-queue-age-avg` i.e. change to `
> > >> application-event-queue-time-avg`? Moreover other than similar naming,
> > >> `time` anyways seems more suitable than `age`, though minor. The
> `time`
> > >> usage is also aligned with the description of this metric.
> > >>
> > >> AM3: Metric `application-event-processing-time` says "the average
> time,
> > >> that the consumer network.". Shall we have the `-avg` suffix in
> the
> > >> metric as we have defined for other metrics? Also do we require the
> max
> > >> metric as well for the same?
> > >>
> > >> AM4: Is the telemetry name for `unsent-requests-queue-size` intended
> > >> as `org.apache.kafka.consumer.unsent.requests.size`,
> > >> or it should be corrected to `
> > >> org.apache.kafka.consumer.unsent.requests.queue.size`?
> > >>
> > >> AM2:
> > >> Regards,
> > >> Apoorv Mittal
> > >> +44 7721681581
> > >>
> > >>
> > >> On Mon, Jul 15, 2024 at 2:45 PM Andrew Schofield <
> > >> andrew_schofi...@live.com>
> > >> wrote:
> > >>
> > >>> Hi Brenden,
> > >>> Thanks for the updates.
> > >>>
> > >>> AS4. I see that you’ve added `.ms` to a bunch of the metrics
> reflecting
> > >> the
> > >>> fact that they’re measured in milliseconds. However, I observe that
> > most
> > >>> me

Re: [VOTE] 3.8.0 RC3

2024-07-23 Thread Greg Harris
Hi Josep,

+1 (binding)

I performed the following validations:
1. I formatted a new log directory and verified that the stray log message
is no longer present (KAFKA-17148)
2. I configured a connect worker with a custom plugin path and
plugin.discovery=service_load and verified that plugin scanning doesn't
throw errors with jackson (KAFKA-17111)
3. I verified that aliases for converters work as-intended (KAFKA-17150)
4. Verified that the allowed.paths configuration works as intended for the
DirectoryConfigProvider (KIP-993)

Thanks for running the release!
Greg

On Tue, Jul 23, 2024 at 12:51 PM Chris Egerton 
wrote:

> Forwarding my response to the other mailing list threads; apologies for
> missing the reply-all the first time!
>
> -- Forwarded message -
> From: Chris Egerton 
> Date: Tue, Jul 23, 2024 at 3:45 PM
> Subject: Re: [VOTE] 3.8.0 RC3
> To: 
>
>
> Hi Josep,
>
> Thanks for running this release! I'm +1 (binding).
>
>
> To verify, I:
> - Built from source using Java 11 with both:
> - - the 3.8.0-rc3 tag on GitHub
> - - the source artifact from
> https://home.apache.org/~jlprat/kafka-3.8.0-rc3/kafka-3.8.0-src.tgz
> - Checked signatures and checksums
> - Ran the quickstart using both:
> - - The build artifact from
> https://home.apache.org/~jlprat/kafka-3.8.0-rc3/kafka_2.13-3.8.0.tgz with
> Java 11 and Scala 13 in KRaft mode
> - - Our JVM-based broker Docker image, apache/kafka:3.8.0-rc3
> - Ran all unit tests
> - Ran all integration tests for Connect and MM2
>
>
> A few small, non-blocking notes:
> 1) The release notes categorize KAFKA-16445 [1] as an improvement, but I
> believe it should be listed as a new feature instead.
> 2) The following unit tests failed the first time around, but passed when
> run a second time:
> - (clients) SaslAuthenticatorTest.testMissingUsernameSaslPlain()
> - (core)
> ProducerIdManagerTest.testUnrecoverableErrors(UNKNOWN_SERVER_ERROR)
> - (core) RemoteLogManagerTest.testCopyQuota(false)
> -
> (core)
> SocketServerTest.testClientDisconnectionWithOutstandingReceivesProcessedUntilFailedSend()
> - (core) ZkMigrationIntegrationTest.testMigrateTopicDeletions [7] Type=ZK,
> MetadataVersion=3.7-IV4,Security=PLAINTEXT
>   - This is also not actually a unit test, but an integration test. Looks
> like we haven't classified it correctly?
>
>
> [1] - https://issues.apache.org/jira/browse/KAFKA-16445
>
> Cheers,
>
> Chris
>
> On Tue, Jul 23, 2024 at 8:42 AM Josep Prat 
> wrote:
>
> > Here is the link to the system tests:
> >
> >
> https://confluent-open-source-kafka-system-test-results.s3-us-west-2.amazonaws.com/3.8/2024-07-22--001.ffbb03b2-61f4-4ebb-ae1f-af5c753682fb--1721733000--confluentinc--3.8--9a2b34b68c/report.html
> >
> > The Quota tests are known to fail in this CI system. Regarding the other
> > tests, they run successfully in the past and they are now timeouting.
> >
> > Best,
> >
> > On Tue, Jul 23, 2024 at 12:07 PM Josep Prat  wrote:
> >
> > > Hello Kafka users, developers and client-developers,
> > >
> > > This is the fourth candidate for release of Apache Kafka 3.8.0.
> > >
> > > Some of the major features included in this release are:
> > > * KIP-1028: Docker Official Image for Apache Kafka
> > > * KIP-974: Docker Image for GraalVM based Native Kafka Broker
> > > * KIP-1036: Extend RecordDeserializationException exception
> > > * KIP-1019: Expose method to determine Metric Measurability
> > > * KIP-1004: Enforce tasks.max property in Kafka Connect
> > > * KIP-989: Improved StateStore Iterator metrics for detecting leaks
> > > * KIP-993: Allow restricting files accessed by File and Directory
> > > ConfigProviders
> > > * KIP-924: customizable task assignment for Streams
> > > * KIP-813: Shareable State Stores
> > > * KIP-719: Deprecate Log4J Appender
> > > * KIP-390: Support Compression Level
> > > * KIP-1018: Introduce max remote fetch timeout config for
> > > DelayedRemoteFetch requests
> > > * KIP-1037: Allow WriteTxnMarkers API with Alter Cluster Permission
> > > * KIP-1047 Introduce new org.apache.kafka.tools.api.Decoder to replace
> > > kafka.serializer.Decoder
> > > * KIP-899: Allow producer and consumer clients to rebootstrap
> > >
> > > Release notes for the 3.8.0 release:
> > > https://home.apache.org/~jlprat/kafka-3.8.0-rc3/RELEASE_NOTES.html
> > >
> > >  Please download, test and vote by Friday, July 26, 9am PT*
> > >
> > >
> > > Kafka's KEYS file containing PGP keys we use to sign the release:
> > > https://kafka.apache.org/KEYS
> > >
> > > * Release artifacts to be voted upon (source and binary):
> > > https://home.apache.org/~jlprat/kafka-3.8.0-rc3/
> > >
> > > * Docker release artifact to be voted upon:
> > > apache/kafka:3.8.0-rc3
> > > apache/kafka-native:3.8.0-rc3
> > >
> > > * Maven artifacts to be voted upon:
> > > https://repository.apache.org/content/groups/staging/org/apache/kafka/
> > >
> > > * Javadoc:
> > > https://home.apache.org/~jlprat/kafka-3.8.0-rc3/javadoc/
> > >
> > > * Tag to be voted upon (off 3.8 branch) is the

[jira] [Created] (KAFKA-17190) AssignmentsManager gets stuck retrying on deleted topics

2024-07-23 Thread Colin McCabe (Jira)
Colin McCabe created KAFKA-17190:


 Summary: AssignmentsManager gets stuck retrying on deleted topics
 Key: KAFKA-17190
 URL: https://issues.apache.org/jira/browse/KAFKA-17190
 Project: Kafka
  Issue Type: Bug
Reporter: Colin McCabe
Assignee: Colin McCabe






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-13403) KafkaServer crashes when deleting topics due to the race in log deletion

2024-07-23 Thread Arun Mathew (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun Mathew resolved KAFKA-13403.
-
Resolution: Fixed

Raised the Patch and Got it Merged.

> KafkaServer crashes when deleting topics due to the race in log deletion
> 
>
> Key: KAFKA-13403
> URL: https://issues.apache.org/jira/browse/KAFKA-13403
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.4.1
>Reporter: Haruki Okada
>Assignee: Arun Mathew
>Priority: Major
>
> h2. Environment
>  * OS: CentOS Linux release 7.6
>  * Kafka version: 2.4.1
>  * 
>  ** But as far as I checked the code, I think same phenomenon could happen 
> even on trunk
>  * Kafka log directory: RAID1+0 (i.e. not using JBOD so only single log.dirs 
> is set)
>  * Java version: AdoptOpenJDK 1.8.0_282
> h2. Phenomenon
> When we were in the middle of deleting several topics by `kafka-topics.sh 
> --delete --topic blah-blah`, one broker in our cluster crashed due to 
> following exception:
>  
> {code:java}
> [2021-10-21 18:19:19,122] ERROR Shutdown broker because all log dirs in 
> /data/kafka have failed (kafka.log.LogManager)
> {code}
>  
>  
> We also found NoSuchFileException was thrown right before the crash when 
> LogManager tried to delete logs for some partitions.
>  
> {code:java}
> [2021-10-21 18:19:18,849] ERROR Error while deleting log for foo-bar-topic-5 
> in dir /data/kafka (kafka.server.LogDirFailureChannel)
> java.nio.file.NoSuchFileException: 
> /data/kafka/foo-bar-topic-5.df3626d2d9eb41a2aeb0b8d55d7942bd-delete/03877066.timeindex.deleted
> at 
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
> at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
> at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
> at 
> sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
> at 
> sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
> at 
> sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
> at java.nio.file.Files.readAttributes(Files.java:1737)
> at java.nio.file.FileTreeWalker.getAttributes(FileTreeWalker.java:219)
> at java.nio.file.FileTreeWalker.visit(FileTreeWalker.java:276)
> at java.nio.file.FileTreeWalker.next(FileTreeWalker.java:372)
> at java.nio.file.Files.walkFileTree(Files.java:2706)
> at java.nio.file.Files.walkFileTree(Files.java:2742)
> at org.apache.kafka.common.utils.Utils.delete(Utils.java:732)
> at kafka.log.Log.$anonfun$delete$2(Log.scala:2036)
> at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> at kafka.log.Log.maybeHandleIOException(Log.scala:2343)
> at kafka.log.Log.delete(Log.scala:2030)
> at kafka.log.LogManager.deleteLogs(LogManager.scala:826)
> at kafka.log.LogManager.$anonfun$deleteLogs$6(LogManager.scala:840)
> at 
> kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:116)
> at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:65)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> So, the log-dir was marked as offline and ended up with KafkaServer crash 
> because the broker has only single log-dir.
> h2. Cause
> We also found below logs right before the NoSuchFileException.
>  
> {code:java}
> [2021-10-21 18:18:17,829] INFO Log for partition foo-bar-5 is renamed to 
> /data/kafka/foo-bar-5.df3626d2d9eb41a2aeb0b8d55d7942bd-delete and is 
> scheduled for deletion (kafka.log.LogManager)
> [2021-10-21 18:18:17,900] INFO [Log partition=foo-bar-5, dir=/data/kafka] 
> Found deletable segments with base offsets [3877066] due to retention time 
> 17280ms breach (kafka.log.Log)[2021-10-21 18:18:17,901] INFO [Log 
> partition=foo-bar-5, dir=/data/kafka] Scheduling segments for deletion 
> List(LogSegment(baseOffset=3877066, size=90316366, 
> lastModifiedTime=1634634956000, largestTime=1634634955854)) (kafka.log.Log)
> {code}
> After checking through Kafka co