Re: Few partitions stuck in under replication

Dhirendra Singh Thu, 03 Mar 2022 08:28:48 -0800

Hi Tom,
Doing the nightly restart is the decision of the cluster admin. I have no
control on it.
We have implementation using stateful set. restart is triggered by updating
a annotation in the pod.
Issue is not triggered by kafka cluster restart but the zookeeper servers
restart.


Thanks,
Dhirendra.

On Thu, Mar 3, 2022 at 7:19 PM Thomas Cooper <c...@tomcooper.dev> wrote:

> Hi Dhirenda,
>
> Firstly, I am interested in why are you restarting the ZK and Kafka
> cluster every night?
>
> Secondly, how are you doing the restarts. For example, in [Strimzi](
> https://strimzi.io/), when we roll the Kafka cluster we leave the
> designated controller broker until last. For each of the other brokers we
> wait until all the partitions they are leaders for are above their minISR
> and then we roll the broker. In this way we maintain availability and make
> sure leadership can move off the rolling broker temporarily.
>
> Cheers,
>
> Tom Cooper
>
> [@tomncooper](https://twitter.com/tomncooper) | https://tomcooper.dev
>
> On 03/03/2022 07:38, Dhirendra Singh wrote:
>
> > Hi All,
> >
> > We have kafka cluster running in kubernetes. kafka version we are using
> is
> > 2.7.1.
> > Every night zookeeper servers and kafka brokers are restarted.
> > After the nightly restart of the zookeeper servers some partitions remain
> > stuck in under replication. This happens randomly but not at every
> nightly
> > restart.
> > Partitions remain under replicated until kafka broker with the partition
> > leader is restarted.
> > For example partition 4 of consumer_offsets topic remain under replicated
> > and we see following error in the log...
> >
> > [2022-02-28 04:01:20,217] WARN [Partition __consumer_offsets-4 broker=1]
> > Controller failed to update ISR to PendingExpandIsr(isr=Set(1),
> > newInSyncReplicaId=2) due to unexpected UNKNOWN_SERVER_ERROR. Retrying.
> > (kafka.cluster.Partition)
> > [2022-02-28 04:01:20,217] ERROR [broker-1-to-controller] Uncaught error
> in
> > request completion: (org.apache.kafka.clients.NetworkClient)
> > java.lang.IllegalStateException: Failed to enqueue `AlterIsr` request
> with
> > state LeaderAndIsr(leader=1, leaderEpoch=2728, isr=List(1, 2),
> > zkVersion=4719) for partition __consumer_offsets-4
> > at kafka.cluster.Partition.sendAlterIsrRequest(Partition.scala:1403)
> > at
> >
> kafka.cluster.Partition.$anonfun$handleAlterIsrResponse$1(Partition.scala:1438)
> > at kafka.cluster.Partition.handleAlterIsrResponse(Partition.scala:1417)
> > at
> >
> kafka.cluster.Partition.$anonfun$sendAlterIsrRequest$1(Partition.scala:1398)
> > at
> >
> kafka.cluster.Partition.$anonfun$sendAlterIsrRequest$1$adapted(Partition.scala:1398)
> > at
> >
> kafka.server.AlterIsrManagerImpl.$anonfun$handleAlterIsrResponse$8(AlterIsrManager.scala:166)
> > at
> >
> kafka.server.AlterIsrManagerImpl.$anonfun$handleAlterIsrResponse$8$adapted(AlterIsrManager.scala:163)
> > at scala.collection.immutable.List.foreach(List.scala:333)
> > at
> >
> kafka.server.AlterIsrManagerImpl.handleAlterIsrResponse(AlterIsrManager.scala:163)
> > at
> >
> kafka.server.AlterIsrManagerImpl.responseHandler$1(AlterIsrManager.scala:94)
> > at
> >
> kafka.server.AlterIsrManagerImpl.$anonfun$sendRequest$2(AlterIsrManager.scala:104)
> > at
> >
> kafka.server.BrokerToControllerRequestThread.handleResponse(BrokerToControllerChannelManagerImpl.scala:175)
> > at
> >
> kafka.server.BrokerToControllerRequestThread.$anonfun$generateRequests$1(BrokerToControllerChannelManagerImpl.scala:158)
> > at
> >
> org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:109)
> > at
> >
> org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:586)
> > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:578)
> > at
> kafka.common.InterBrokerSendThread.doWork(InterBrokerSendThread.scala:71)
> > at
> >
> kafka.server.BrokerToControllerRequestThread.doWork(BrokerToControllerChannelManagerImpl.scala:183)
> > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)
> > Looks like some kind of race condition bug...anyone has any idea ?
> >
> > Thanks,
> > Dhirendra

Re: Few partitions stuck in under replication

Reply via email to