Hello,

we have Kafka v3.5.1 in KRAFT mode running, with two datacenters (via 10Gb/s Darkfiber):

DC 1 : 3 nodes Controller / Broker
DC 2 : 2 nodes Controller / Broker
DC 2 : 1 node Broker

Exactly at the same time: 21:01:00 (CEST) the cluster is unstable and no producer / consumer can access the cluster

Every node has:

* Own node ID
* RACK ID

 grep -E  '(id|rack)' /etc/kafka/server.properties

broker.rack=0
node.id=1

broker.rack=0 -> DC1, broker.rack=1 -> DC2

We have the complete same setup also on our test system, but it runs without any issues. The only differences, are the missing darkfiber and different hostnames / certs. The rest is the same,because we use Puppet for CFG management.

The logs looks like this:

DC 1, Node 1:

Jan 28 21:01:05 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:05,135] INFO [RaftManager id=1] Completed transition to Unattached(epoch=1494, voters=[1, 2, 3, 4, 5], electionTimeoutMs=1638) from FollowerState(fetchTimeoutMs=2000, epoch=1493, leaderId=5, voters=[1, 2, 3, 4, 5], highWatermark=Optional[LogOffsetMetadata(offset=20072183, me> Jan 28 21:01:05 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:05,137] INFO [RaftManager id=1] Vote request VoteRequestData(clusterId='Rnpnd4EcRBeWo8vUrWlOIQ', topics=[TopicData(topicName='__cluster_metadata', partitions=[PartitionData(partitionIndex=0, candidateEpoch=1494, candidateId=2, lastOffsetEpoch=1493, lastOffset=20072146)])]) w> Jan 28 21:01:05 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:05,137] INFO [QuorumController id=1] In the new epoch 1494, the leader is (none). (org.apache.kafka.controller.QuorumController) Jan 28 21:01:05 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:05,257] INFO [RaftManager id=1] Completed transition to Unattached(epoch=1495, voters=[1, 2, 3, 4, 5], electionTimeoutMs=1511) from Unattached(epoch=1494, voters=[1, 2, 3, 4, 5], electionTimeoutMs=1638) (org.apache.kafka.raft.QuorumState) Jan 28 21:01:05 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:05,258] INFO [RaftManager id=1] Vote request VoteRequestData(clusterId='Rnpnd4EcRBeWo8vUrWlOIQ', topics=[TopicData(topicName='__cluster_metadata', partitions=[PartitionData(partitionIndex=0, candidateEpoch=1495, candidateId=2, lastOffsetEpoch=1493, lastOffset=20072146)])]) w> Jan 28 21:01:05 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:05,258] INFO [QuorumController id=1] In the new epoch 1495, the leader is (none). (org.apache.kafka.controller.QuorumController) Jan 28 21:01:05 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:05,378] INFO [RaftManager id=1] Completed transition to Unattached(epoch=1496, voters=[1, 2, 3, 4, 5], electionTimeoutMs=1391) from Unattached(epoch=1495, voters=[1, 2, 3, 4, 5], electionTimeoutMs=1511) (org.apache.kafka.raft.QuorumState) Jan 28 21:01:05 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:05,378] INFO [RaftManager id=1] Vote request VoteRequestData(clusterId='Rnpnd4EcRBeWo8vUrWlOIQ', topics=[TopicData(topicName='__cluster_metadata', partitions=[PartitionData(partitionIndex=0, candidateEpoch=1496, candidateId=2, lastOffsetEpoch=1493, lastOffset=20072146)])]) w> Jan 28 21:01:05 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:05,378] INFO [QuorumController id=1] In the new epoch 1496, the leader is (none). (org.apache.kafka.controller.QuorumController) Jan 28 21:01:05 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:05,902] INFO [RaftManager id=1] Completed transition to Unattached(epoch=1497, voters=[1, 2, 3, 4, 5], electionTimeoutMs=870) from Unattached(epoch=1496, voters=[1, 2, 3, 4, 5], electionTimeoutMs=1391) (org.apache.kafka.raft.QuorumState) Jan 28 21:01:05 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:05,902] INFO [RaftManager id=1] Vote request VoteRequestData(clusterId='Rnpnd4EcRBeWo8vUrWlOIQ', topics=[TopicData(topicName='__cluster_metadata', partitions=[PartitionData(partitionIndex=0, candidateEpoch=1497, candidateId=2, lastOffsetEpoch=1493, lastOffset=20072146)])]) w> Jan 28 21:01:05 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:05,902] INFO [QuorumController id=1] In the new epoch 1497, the leader is (none). (org.apache.kafka.controller.QuorumController) Jan 28 21:01:06 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:06,198] INFO [BrokerToControllerChannelManager id=1 name=heartbeat] Client requested disconnect from node 5 (org.apache.kafka.clients.NetworkClient) Jan 28 21:01:06 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:06,349] INFO [RaftManager id=1] Completed transition to Unattached(epoch=1498, voters=[1, 2, 3, 4, 5], electionTimeoutMs=422) from Unattached(epoch=1497, voters=[1, 2, 3, 4, 5], electionTimeoutMs=870) (org.apache.kafka.raft.QuorumState) Jan 28 21:01:06 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:06,350] INFO [QuorumController id=1] In the new epoch 1498, the leader is (none). (org.apache.kafka.controller.QuorumController) Jan 28 21:01:06 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:06,357] INFO [RaftManager id=1] Completed transition to Voted(epoch=1498, votedId=3, voters=[1, 2, 3, 4, 5], electionTimeoutMs=1456) from Unattached(epoch=1498, voters=[1, 2, 3, 4, 5], electionTimeoutMs=422) (org.apache.kafka.raft.QuorumState) Jan 28 21:01:06 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:06,357] INFO [RaftManager id=1] Vote request VoteRequestData(clusterId='Rnpnd4EcRBeWo8vUrWlOIQ', topics=[TopicData(topicName='__cluster_metadata', partitions=[PartitionData(partitionIndex=0, candidateEpoch=1498, candidateId=3, lastOffsetEpoch=1493, lastOffset=20072184)])]) w> Jan 28 21:01:06 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:06,388] INFO [RaftManager id=1] Completed transition to FollowerState(fetchTimeoutMs=2000, epoch=1498, leaderId=3, voters=[1, 2, 3, 4, 5], highWatermark=Optional[LogOffsetMetadata(offset=20072183, metadata=Optional.empty)], fetchingSnapshot=Optional.empty) from Voted(epoch=1> Jan 28 21:01:06 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:06,389] INFO [QuorumController id=1] In the new epoch 1498, the leader is 3. (org.apache.kafka.controller.QuorumController) Jan 28 21:01:06 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:06,401] INFO [broker-1-to-controller-heartbeat-channel-manager]: Recorded new controller, from now on will use node qh-a08-kafka-03.example.com:9093 (id: 3 rack: null) (kafka.server.BrokerToControllerRequestThread) Jan 28 21:01:06 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:06,428] INFO [BrokerToControllerChannelManager id=1 name=heartbeat] Client requested disconnect from node 3 (org.apache.kafka.clients.NetworkClient) Jan 28 21:01:06 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:06,428] INFO [broker-1-to-controller-heartbeat-channel-manager]: Recorded new controller, from now on will use node qh-a08-kafka-03.example.com:9093 (id: 3 rack: null) (kafka.server.BrokerToControllerRequestThread) Jan 28 21:01:06 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:06,479] INFO [broker-1-to-controller-heartbeat-channel-manager]: Recorded new controller, from now on will use node qh-a08-kafka-03.example.com:9093 (id: 3 rack: null) (kafka.server.BrokerToControllerRequestThread) Jan 28 21:01:18 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:18,479] INFO [RaftManager id=1] Become candidate due to fetch timeout (org.apache.kafka.raft.KafkaRaftClient) Jan 28 21:01:18 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:18,486] INFO [RaftManager id=1] Completed transition to CandidateState(localId=1, epoch=1499, retries=1, voteStates={1=GRANTED, 2=UNRECORDED, 3=UNRECORDED, 4=UNRECORDED, 5=UNRECORDED}, highWatermark=Optional[LogOffsetMetadata(offset=20072204, metadata=Optional.empty)], elect> Jan 28 21:01:18 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:18,487] INFO [QuorumController id=1] In the new epoch 1499, the leader is (none). (org.apache.kafka.controller.QuorumController) Jan 28 21:01:18 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:18,488] INFO [RaftManager id=1] Disconnecting from node 3 due to request timeout. (org.apache.kafka.clients.NetworkClient) Jan 28 21:01:18 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:18,489] INFO [RaftManager id=1] Cancelled in-flight FETCH request with correlation id 428513 due to node 3 being disconnected (elapsed time since creation: 2008ms, elapsed time since send: 2007ms, request timeout: 2000ms) (org.apache.kafka.clients.NetworkClient) Jan 28 21:01:18 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:18,522] INFO [RaftManager id=1] Vote request VoteRequestData(clusterId='Rnpnd4EcRBeWo8vUrWlOIQ', topics=[TopicData(topicName='__cluster_metadata', partitions=[PartitionData(partitionIndex=0, candidateEpoch=1499, candidateId=4, lastOffsetEpoch=1498, lastOffset=20072204)])]) w> Jan 28 21:01:18 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:18,535] INFO [RaftManager id=1] Vote request VoteRequestData(clusterId='Rnpnd4EcRBeWo8vUrWlOIQ', topics=[TopicData(topicName='__cluster_metadata', partitions=[PartitionData(partitionIndex=0, candidateEpoch=1499, candidateId=5, lastOffsetEpoch=1498, lastOffset=20072204)])]) w> Jan 28 21:01:19 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:19,734] INFO [RaftManager id=1] Completed transition to Unattached(epoch=1500, voters=[1, 2, 3, 4, 5], electionTimeoutMs=445) from CandidateState(localId=1, epoch=1499, retries=1, voteStates={1=GRANTED, 2=UNRECORDED, 3=UNRECORDED, 4=REJECTED, 5=REJECTED}, highWatermark=Optio> Jan 28 21:01:19 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:19,734] INFO [QuorumController id=1] In the new epoch 1500, the leader is (none). (org.apache.kafka.controller.QuorumController) Jan 28 21:01:19 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:19,739] INFO [RaftManager id=1] Completed transition to Voted(epoch=1500, votedId=5, voters=[1, 2, 3, 4, 5], electionTimeoutMs=1624) from Unattached(epoch=1500, voters=[1, 2, 3, 4, 5], electionTimeoutMs=445) (org.apache.kafka.raft.QuorumState) Jan 28 21:01:19 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:19,739] INFO [RaftManager id=1] Vote request VoteRequestData(clusterId='Rnpnd4EcRBeWo8vUrWlOIQ', topics=[TopicData(topicName='__cluster_metadata', partitions=[PartitionData(partitionIndex=0, candidateEpoch=1500, candidateId=5, lastOffsetEpoch=1498, lastOffset=20072204)])]) w> Jan 28 21:01:19 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:19,773] INFO [RaftManager id=1] Completed transition to FollowerState(fetchTimeoutMs=2000, epoch=1500, leaderId=5, voters=[1, 2, 3, 4, 5], highWatermark=Optional[LogOffsetMetadata(offset=20072204, metadata=Optional.empty)], fetchingSnapshot=Optional.empty) from Voted(epoch=1> Jan 28 21:01:19 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:19,774] INFO [QuorumController id=1] In the new epoch 1500, the leader is 5. (org.apache.kafka.controller.QuorumController) Jan 28 21:01:20 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:20,781] INFO [RaftManager id=1] Disconnecting from node 2 due to request timeout. (org.apache.kafka.clients.NetworkClient) Jan 28 21:01:20 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:20,781] INFO [RaftManager id=1] Cancelled in-flight VOTE request with correlation id 428514 due to node 2 being disconnected (elapsed time since creation: 2294ms, elapsed time since send: 2254ms, request timeout: 2000ms) (org.apache.kafka.clients.NetworkClient) Jan 28 21:01:20 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:20,783] INFO [RaftManager id=1] Disconnecting from node 3 due to request timeout. (org.apache.kafka.clients.NetworkClient) Jan 28 21:01:20 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:20,783] INFO [RaftManager id=1] Cancelled in-flight VOTE request with correlation id 428521 due to node 3 being disconnected (elapsed time since creation: 2238ms, elapsed time since send: 2215ms, request timeout: 2000ms) (org.apache.kafka.clients.NetworkClient) Jan 28 21:01:21 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:21,017] INFO [BrokerToControllerChannelManager id=1 name=heartbeat] Disconnecting from node 3 due to request timeout. (org.apache.kafka.clients.NetworkClient) Jan 28 21:01:21 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:21,017] INFO [BrokerToControllerChannelManager id=1 name=heartbeat] Cancelled in-flight BROKER_HEARTBEAT request with correlation id 107067 due to node 3 being disconnected (elapsed time since creation: 4501ms, elapsed time since send: 4501ms, request timeout: 4500ms) (org.a> Jan 28 21:01:21 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:21,017] INFO [broker-1-to-controller-heartbeat-channel-manager]: Recorded new controller, from now on will use node fc-r01-kafka-02.example.com:9093 (id: 5 rack: null) (kafka.server.BrokerToControllerRequestThread) Jan 28 21:01:21 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:21,017] INFO [BrokerLifecycleManager id=1] Unable to send a heartbeat because the RPC got timed out before it could be sent. (kafka.server.BrokerLifecycleManager) Jan 28 21:01:28 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:28,810] INFO [ReplicaFetcher replicaId=1, leaderId=5, fetcherId=0] Partition company21_pc21_transaction-2 has an older epoch (89) than the current leader. Will await the new LeaderAndIsr state before resuming fetching. (kafka.server.ReplicaFetcherThread) Jan 28 21:01:28 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:28,811] WARN [ReplicaFetcher replicaId=1, leaderId=5, fetcherId=0] Partition company21_pc21_transaction-2 marked as failed (kafka.server.ReplicaFetcherThread) Jan 28 21:01:28 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:28,842] INFO [ReplicaFetcher replicaId=1, leaderId=5, fetcherId=0] Partition kafka_proxy_test2-0 has an older epoch (86) than the current leader. Will await the new LeaderAndIsr state before resuming fetching. (kafka.server.ReplicaFetcherThread) Jan 28 21:01:28 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:28,842] WARN [ReplicaFetcher replicaId=1, leaderId=5, fetcherId=0] Partition kafka_proxy_test2-0 marked as failed (kafka.server.ReplicaFetcherThread)

...
Jan 28 21:01:28 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:28,881] INFO [ReplicaFetcher replicaId=1, leaderId=5, fetcherId=0] Partition chargebacks-3 has an older epoch (86) than the current leader. Will await the new LeaderAndIsr state before resuming fetching. (kafka.server.ReplicaFetcherThread) Jan 28 21:01:28 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:28,881] WARN [ReplicaFetcher replicaId=1, leaderId=5, fetcherId=0] Partition chargebacks-3 marked as failed (kafka.server.ReplicaFetcherThread) Jan 28 21:01:29 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:29,300] INFO [ReplicaFetcherManager on broker 1] Removed fetcher for partitions Set(mm2-offsets.FC-R02.internal-24, __consumer_offsets-48, __consumer_offsets-13, kafka_proxy_test1-0, mm2-configs.FC-R02.internal-0, __consumer_offsets-20, mm2-status.FC-R02.internal-1, __consum> Jan 28 21:01:29 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:29,371] INFO [ReplicaFetcher replicaId=1, leaderId=4, fetcherId=0] Partition company21_pc21_transaction-9 has an older epoch (89) than the current leader. Will await the new LeaderAndIsr state before resuming fetching. (kafka.server.ReplicaFetcherThread) Jan 28 21:01:29 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:29,372] WARN [ReplicaFetcher replicaId=1, leaderId=4, fetcherId=0] Partition company21_pc21_transaction-9 marked as failed (kafka.server.ReplicaFetcherThread) Jan 28 21:01:29 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:29,439] INFO [ReplicaFetcherManager on broker 1] Removed fetcher for partitions Set(blacklist_transactions-9, kafka_proxy_test2-0, mm2-offsets.FC-R02.internal-18, mm2-offsets.FC-R02.internal-23, company21_pc21_transaction-2, chargebacks-1, company21_pc21_transaction-0, p> Jan 28 21:01:29 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:29,443] INFO [ReplicaFetcherManager on broker 1] Added fetcher to broker 2 for partitions HashMap(blacklist_transactions-9 -> InitialFetchState(Some(RxifGFHPQsGMWP5Sq_rSFg),BrokerEndPoint(id=2, host=qh-a08-kafka-02.example.com:9092),100,1748933), mm2-offsets.FC-R02.internal-2> Jan 28 21:01:29 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:29,444] INFO [ReplicaFetcherManager on broker 1] Added fetcher to broker 5 for partitions HashMap(company21_pc21_transaction-2 -> InitialFetchState(Some(oruOHN6SSLOsuxgB4YGuyw),BrokerEndPoint(id=5, host=fc-r01-kafka-02.example.com:9092),90,6704572), kafka_proxy_test2-0 -> I> Jan 28 21:01:29 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:29,444] INFO [ReplicaFetcherManager on broker 1] Added fetcher to broker 4 for partitions HashMap(chargebacks-1 -> InitialFetchState(Some(Df7E7Y3-TxKjd5QIBB2mgg),BrokerEndPoint(id=4, host=fc-r01-kafka-01.example.com:9092),73,0), company21_pc21_transaction-0 -> InitialFetchS> Jan 28 21:01:29 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:29,446] INFO [ReplicaFetcherThread-0-3]: Shutting down (kafka.server.ReplicaFetcherThread) Jan 28 21:01:29 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:29,447] INFO [ReplicaFetcher replicaId=1, leaderId=3, fetcherId=0] Client requested connection close from node 3 (org.apache.kafka.clients.NetworkClient) Jan 28 21:01:29 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:29,448] INFO [ReplicaFetcher replicaId=1, leaderId=3, fetcherId=0] Cancelled in-flight FETCH request with correlation id 191052 due to node 3 being disconnected (elapsed time since creation: 5306ms, elapsed time since send: 5306ms, request timeout: 30000ms) (org.apache.kafka> Jan 28 21:01:29 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:29,448] INFO [ReplicaFetcher replicaId=1, leaderId=3, fetcherId=0] Error sending fetch request (sessionId=359115694, epoch=191052) to node 3: (org.apache.kafka.clients.FetchSessionHandler) Jan 28 21:01:29 qh-a08-kafka-01 kafka[1936210]: java.io.IOException: Client was shutdown before response was read Jan 28 21:01:29 qh-a08-kafka-01 kafka[1936210]:         at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:108) Jan 28 21:01:29 qh-a08-kafka-01 kafka[1936210]:         at kafka.server.BrokerBlockingSender.sendRequest(BrokerBlockingSender.scala:113) Jan 28 21:01:29 qh-a08-kafka-01 kafka[1936210]:         at kafka.server.RemoteLeaderEndPoint.fetch(RemoteLeaderEndPoint.scala:79) Jan 28 21:01:29 qh-a08-kafka-01 kafka[1936210]:         at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:316) Jan 28 21:01:29 qh-a08-kafka-01 kafka[1936210]:         at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:130) Jan 28 21:01:29 qh-a08-kafka-01 kafka[1936210]:         at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3$adapted(AbstractFetcherThread.scala:129) Jan 28 21:01:29 qh-a08-kafka-01 kafka[1936210]:         at scala.Option.foreach(Option.scala:437) Jan 28 21:01:29 qh-a08-kafka-01 kafka[1936210]:         at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129) Jan 28 21:01:29 qh-a08-kafka-01 kafka[1936210]:         at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:112) Jan 28 21:01:29 qh-a08-kafka-01 kafka[1936210]:         at kafka.server.ReplicaFetcherThread.doWork(ReplicaFetcherThread.scala:98) Jan 28 21:01:29 qh-a08-kafka-01 kafka[1936210]:         at org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:127) Jan 28 21:01:29 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:29,450] INFO [ReplicaFetcherThread-0-3]: Stopped (kafka.server.ReplicaFetcherThread) Jan 28 21:01:29 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:29,450] INFO [ReplicaFetcherThread-0-3]: Shutdown completed (kafka.server.ReplicaFetcherThread) Jan 28 21:01:29 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:29,454] INFO [GroupCoordinator 1]: Elected as the group coordinator for partition 48 in epoch 50 (kafka.coordinator.group.GroupCoordinator) Jan 28 21:01:29 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:29,454] INFO [GroupMetadataManager brokerId=1] Scheduling loading of offsets and group metadata from __consumer_offsets-48 for epoch 50 (kafka.coordinator.group.GroupMetadataManager) Jan 28 21:01:29 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:29,455] INFO [GroupCoordinator 1]: Elected as the group coordinator for partition 13 in epoch 50 (kafka.coordinator.group.GroupCoordinator) Jan 28 21:01:29 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:29,455] INFO [GroupMetadataManager brokerId=1] Scheduling loading of offsets and group metadata from __consumer_offsets-13 for epoch 50 (kafka.coordinator.group.GroupMetadataManager) Jan 28 21:01:29 qh-a08-kafka-01 kafka[1936210]: [2024-01-28 21:01:29,455] INFO [GroupCoordinator 1]: Elected as the group coordinator for partition 30 in epoch 63 (kafka.coordinator.group.GroupCoordinator)


...

after a while ~10-20secs later, all is fine again.

Before we switched to this new cluster, we had mm2 configured, to sync all from the old 3.5.1 Cluster, but with Zookeeper to the new one, with KRAFT enabled.

The whole config for broker / controller is:

===============================
advertised.listeners=INTERNAL://qh-a08-kafka-01.example.com:9092,CLIENT://:9095,EXTERNAL://qh-a08-kafka-01.example.com:63796
allow.everyone.if.no.acl.found=true
authorizer.class.name=org.apache.kafka.metadata.authorizer.StandardAuthorizer
auto.create.topics.enable=false
broker.rack=0
controller.listener.names=CONTROLLER
controller.quorum.voters=1...@qh-a08-kafka-01.example.com:9093,2...@qh-a08-kafka-02.example.com:9093,3...@qh-a08-kafka-03.example.com:9093,4...@fc-r01-kafka-01.example.com:9093,5...@fc-r01-kafka-02.example.com:9093
default.replication.factor=3
early.start.listeners=CONTROLLER
inter.broker.listener.name=INTERNAL
listener.name.controller.ssl.client.auth=required
listener.security.protocol.map=INTERNAL:SASL_SSL,CLIENT:SASL_SSL,CONTROLLER:SSL,EXTERNAL:SASL_SSL
listeners=INTERNAL://:9092,CLIENT://:9095,CONTROLLER://:9093, EXTERNAL://:9094
log.cleanup.policy=delete
log.dirs=/data/kafka/
log.retention.check.interval.ms=300000
log.retention.hours=24
log.segment.bytes=1073741824
min.insync.replicas=2
node.id=1
num.io.threads=8
num.network.threads=3
num.partitions=4
num.recovery.threads.per.data.dir=3
offsets.topic.replication.factor=2
process.roles=broker,controller
sasl.enabled.mechanisms=PLAIN,SASL_SSL
sasl.mechanism.inter.broker.protocol=PLAIN
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
socket.send.buffer.bytes=102400
ssl.cipher.suites=TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA384,TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384,TLS_RSA_WITH_AES_256_CBC_SHA256,TLS_ECDH_ECDSA_WITH_AES_256_CBC_SHA384,TLS_ECDH_RSA_WITH_AES_256_CBC_SHA384,TLS_DHE_RSA_WITH_AES_256_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256,TLS_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDH_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDH_RSA_WITH_AES_128_CBC_SHA256,TLS_DHE_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDH_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDH_RSA_WITH_AES_256_GCM_SHA384,TLS_DHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDH_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDH_RSA_WITH_AES_128_GCM_SHA256,TLS_DHE_RSA_WITH_AES_128_GCM_SHA256
ssl.enabled.protocols=TLSv1.2
ssl.key.password=KnezAhKPNKn-53f.99unuuCp,EwfXq
ssl.keystore.location=/etc/ssl/private/kafka_example_chain.crt
ssl.keystore.type=PEM
ssl.truststore.type=PEM
ssl.truststore.location=/etc/ssl/private/kafka_example_chain.crt
super.users=User:CN=*.example.com
transaction.state.log.min.isr=2
transaction.state.log.replication.factor=2
============================================

The config from the broker only, looks this:

============================================

advertised.listeners=INTERNAL://fc-r01-kafka-03.example.com:9092,CLIENT://:9095,EXTERNAL://fc-r01-kafka-03.example.com:63796
allow.everyone.if.no.acl.found=true
authorizer.class.name=org.apache.kafka.metadata.authorizer.StandardAuthorizer
auto.create.topics.enable=false
broker.rack=1
controller.listener.names=CONTROLLER
controller.quorum.voters=1...@qh-a08-kafka-01.example.com:9093,2...@qh-a08-kafka-02.example.com:9093,3...@qh-a08-kafka-03.example.com:9093,4...@fc-r01-kafka-01.example.com:9093,5...@fc-r01-kafka-02.example.com:9093
default.replication.factor=3
inter.broker.listener.name=INTERNAL
listener.name.controller.ssl.client.auth=required
listener.security.protocol.map=INTERNAL:SASL_SSL,CLIENT:SASL_SSL,CONTROLLER:SSL,EXTERNAL:SASL_SSL
listeners=INTERNAL://:9092,CLIENT://:9095,EXTERNAL://:9094
log.cleanup.policy=delete
log.dirs=/data/kafka/
log.retention.check.interval.ms=300000
log.retention.hours=24
log.segment.bytes=1073741824
min.insync.replicas=2
node.id=6
num.io.threads=8
num.network.threads=3
num.partitions=4
num.recovery.threads.per.data.dir=3
offsets.topic.replication.factor=2
process.roles=broker
sasl.enabled.mechanisms=PLAIN,SASL_SSL
sasl.mechanism.inter.broker.protocol=PLAIN
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
socket.send.buffer.bytes=102400
ssl.cipher.suites=TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA384,TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384,TLS_RSA_WITH_AES_256_CBC_SHA256,TLS_ECDH_ECDSA_WITH_AES_256_CBC_SHA384,TLS_ECDH_RSA_WITH_AES_256_CBC_SHA384,TLS_DHE_RSA_WITH_AES_256_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256,TLS_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDH_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDH_RSA_WITH_AES_128_CBC_SHA256,TLS_DHE_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDH_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDH_RSA_WITH_AES_256_GCM_SHA384,TLS_DHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDH_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDH_RSA_WITH_AES_128_GCM_SHA256,TLS_DHE_RSA_WITH_AES_128_GCM_SHA256
ssl.enabled.protocols=TLSv1.2
ssl.key.password=KnezAhKPNKn-53f.99unuuCp,EwfXq
ssl.keystore.location=/etc/ssl/private/kafka_example_chain.crt
ssl.keystore.type=PEM
ssl.truststore.type=PEM
super.users=User:CN=*.example.com
transaction.state.log.min.isr=2
transaction.state.log.replication.factor=2


Any suggestion, what can be the issue ?

cu denny


Reply via email to