Hi Dhirenda,

Firstly, I am interested in why are you restarting the ZK and Kafka cluster 
every night?

Secondly, how are you doing the restarts. For example, in 
[Strimzi](https://strimzi.io/), when we roll the Kafka cluster we leave the 
designated controller broker until last. For each of the other brokers we wait 
until all the partitions they are leaders for are above their minISR and then 
we roll the broker. In this way we maintain availability and make sure 
leadership can move off the rolling broker temporarily.

Cheers,

Tom Cooper

[@tomncooper](https://twitter.com/tomncooper) | https://tomcooper.dev

On 03/03/2022 07:38, Dhirendra Singh wrote:

> Hi All,
>
> We have kafka cluster running in kubernetes. kafka version we are using is
> 2.7.1.
> Every night zookeeper servers and kafka brokers are restarted.
> After the nightly restart of the zookeeper servers some partitions remain
> stuck in under replication. This happens randomly but not at every nightly
> restart.
> Partitions remain under replicated until kafka broker with the partition
> leader is restarted.
> For example partition 4 of consumer_offsets topic remain under replicated
> and we see following error in the log...
>
> [2022-02-28 04:01:20,217] WARN [Partition __consumer_offsets-4 broker=1]
> Controller failed to update ISR to PendingExpandIsr(isr=Set(1),
> newInSyncReplicaId=2) due to unexpected UNKNOWN_SERVER_ERROR. Retrying.
> (kafka.cluster.Partition)
> [2022-02-28 04:01:20,217] ERROR [broker-1-to-controller] Uncaught error in
> request completion: (org.apache.kafka.clients.NetworkClient)
> java.lang.IllegalStateException: Failed to enqueue `AlterIsr` request with
> state LeaderAndIsr(leader=1, leaderEpoch=2728, isr=List(1, 2),
> zkVersion=4719) for partition __consumer_offsets-4
> at kafka.cluster.Partition.sendAlterIsrRequest(Partition.scala:1403)
> at
> kafka.cluster.Partition.$anonfun$handleAlterIsrResponse$1(Partition.scala:1438)
> at kafka.cluster.Partition.handleAlterIsrResponse(Partition.scala:1417)
> at
> kafka.cluster.Partition.$anonfun$sendAlterIsrRequest$1(Partition.scala:1398)
> at
> kafka.cluster.Partition.$anonfun$sendAlterIsrRequest$1$adapted(Partition.scala:1398)
> at
> kafka.server.AlterIsrManagerImpl.$anonfun$handleAlterIsrResponse$8(AlterIsrManager.scala:166)
> at
> kafka.server.AlterIsrManagerImpl.$anonfun$handleAlterIsrResponse$8$adapted(AlterIsrManager.scala:163)
> at scala.collection.immutable.List.foreach(List.scala:333)
> at
> kafka.server.AlterIsrManagerImpl.handleAlterIsrResponse(AlterIsrManager.scala:163)
> at
> kafka.server.AlterIsrManagerImpl.responseHandler$1(AlterIsrManager.scala:94)
> at
> kafka.server.AlterIsrManagerImpl.$anonfun$sendRequest$2(AlterIsrManager.scala:104)
> at
> kafka.server.BrokerToControllerRequestThread.handleResponse(BrokerToControllerChannelManagerImpl.scala:175)
> at
> kafka.server.BrokerToControllerRequestThread.$anonfun$generateRequests$1(BrokerToControllerChannelManagerImpl.scala:158)
> at
> org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:109)
> at
> org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:586)
> at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:578)
> at kafka.common.InterBrokerSendThread.doWork(InterBrokerSendThread.scala:71)
> at
> kafka.server.BrokerToControllerRequestThread.doWork(BrokerToControllerChannelManagerImpl.scala:183)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)
> Looks like some kind of race condition bug...anyone has any idea ?
>
> Thanks,
> Dhirendra

Reply via email to