Hi Thomas, I see the IllegalStateException but as i pasted earlier it is java.lang.IllegalStateException: Failed to enqueue `AlterIsr` request with state LeaderAndIsr(leader=1, leaderEpoch=2728, isr=List(1, 2), zkVersion=4719) for partition __consumer_offsets-4
I upgraded to version 2.8.1 but issue is not resolved. Thanks, Dhirendra. On Mon, Mar 7, 2022 at 5:45 AM Liam Clarke-Hutchinson <lclar...@redhat.com> wrote: > Ah, I may have seen this error before. Dhirendra Singh, If you grep your > logs, you may find an IllegalStateException or two. > > https://issues.apache.org/jira/browse/KAFKA-12948 > > You need to upgrade to 2.7.2 if this is the issue you're hitting. > > Kind regards, > > Liam Clarke-Hutchinson > > On Sun, 6 Mar 2022 at 04:30, Mailbox - Dhirendra Kumar Singh < > dhirendr...@gmail.com> wrote: > > > Let me rephrase my issue. > > Issue occur when broker loose connectivity to zookeeper server. > > Connectivity loss can happen due to many reasons…zookeeper servers > getting > > bounced, due to some network glitch etc… > > > > After the brokers reconnect to zookeeper server I expect the kafka > cluster > > to come back in stable state by itself without any manual intervention. > > but instead few partitions remain under replicated due to the error I > > pasted earlier. > > > > I feel this is some kind of bug. I am going to file a bug. > > > > > > > > Thanks, > > > > Dhirendra. > > > > > > > > From: Thomas Cooper <c...@tomcooper.dev> > > Sent: Friday, March 4, 2022 7:01 PM > > To: Dhirendra Singh <dhirendr...@gmail.com> > > Cc: users@kafka.apache.org > > Subject: Re: Few partitions stuck in under replication > > > > > > > > Do you roll the controller last? > > > > I suspect this is more to do with the way you are rolling the cluster > > (which I am still not clear on the need for) rather than some kind of bug > > in Kafka (though that could of course be the case). > > > > Tom > > > > On 04/03/2022 01:59, Dhirendra Singh wrote: > > > > Hi Tom, > > During the rolling restart we check for under replicated partition count > > to be zero in the readiness probe before restarting the next POD in > order. > > This issue never occurred before. It started after we upgraded kafka > > version from 2.5.0 to 2.7.1. > > So i suspect some bug introduced in the version after 2.5.0. > > > > > > > > Thanks, > > > > Dhirendra. > > > > > > > > On Thu, Mar 3, 2022 at 11:09 PM Thomas Cooper <c...@tomcooper.dev > <mailto: > > c...@tomcooper.dev> > wrote: > > > > I suspect this nightly rolling will have something to do with your > issues. > > If you are just rolling the stateful set in order, with no dependence on > > maintaining minISR and other Kafka considerations you are going to hit > > issues. > > > > If you are running on Kubernetes I would suggest using an Operator like > > Strimzi <https://strimzi.io/> which will do a lot of the Kafka admin > > tasks like this for you automatically. > > > > Tom > > > > On 03/03/2022 16:28, Dhirendra Singh wrote: > > > > Hi Tom, > > > > Doing the nightly restart is the decision of the cluster admin. I have no > > control on it. > > We have implementation using stateful set. restart is triggered by > > updating a annotation in the pod. > > Issue is not triggered by kafka cluster restart but the zookeeper servers > > restart. > > > > Thanks, > > > > Dhirendra. > > > > > > > > On Thu, Mar 3, 2022 at 7:19 PM Thomas Cooper <c...@tomcooper.dev > <mailto: > > c...@tomcooper.dev> > wrote: > > > > Hi Dhirenda, > > > > Firstly, I am interested in why are you restarting the ZK and Kafka > > cluster every night? > > > > Secondly, how are you doing the restarts. For example, in [Strimzi]( > > https://strimzi.io/), when we roll the Kafka cluster we leave the > > designated controller broker until last. For each of the other brokers we > > wait until all the partitions they are leaders for are above their minISR > > and then we roll the broker. In this way we maintain availability and > make > > sure leadership can move off the rolling broker temporarily. > > > > Cheers, > > > > Tom Cooper > > > > [@tomncooper](https://twitter.com/tomncooper) | https://tomcooper.dev > > > > On 03/03/2022 07:38, Dhirendra Singh wrote: > > > > > Hi All, > > > > > > We have kafka cluster running in kubernetes. kafka version we are using > > is > > > 2.7.1. > > > Every night zookeeper servers and kafka brokers are restarted. > > > After the nightly restart of the zookeeper servers some partitions > remain > > > stuck in under replication. This happens randomly but not at every > > nightly > > > restart. > > > Partitions remain under replicated until kafka broker with the > partition > > > leader is restarted. > > > For example partition 4 of consumer_offsets topic remain under > replicated > > > and we see following error in the log... > > > > > > [2022-02-28 04:01:20,217] WARN [Partition __consumer_offsets-4 > broker=1] > > > Controller failed to update ISR to PendingExpandIsr(isr=Set(1), > > > newInSyncReplicaId=2) due to unexpected UNKNOWN_SERVER_ERROR. Retrying. > > > (kafka.cluster.Partition) > > > [2022-02-28 04:01:20,217] ERROR [broker-1-to-controller] Uncaught error > > in > > > request completion: (org.apache.kafka.clients.NetworkClient) > > > java.lang.IllegalStateException: Failed to enqueue `AlterIsr` request > > with > > > state LeaderAndIsr(leader=1, leaderEpoch=2728, isr=List(1, 2), > > > zkVersion=4719) for partition __consumer_offsets-4 > > > at kafka.cluster.Partition.sendAlterIsrRequest(Partition.scala:1403) > > > at > > > > > > kafka.cluster.Partition.$anonfun$handleAlterIsrResponse$1(Partition.scala:1438) > > > at kafka.cluster.Partition.handleAlterIsrResponse(Partition.scala:1417) > > > at > > > > > > kafka.cluster.Partition.$anonfun$sendAlterIsrRequest$1(Partition.scala:1398) > > > at > > > > > > kafka.cluster.Partition.$anonfun$sendAlterIsrRequest$1$adapted(Partition.scala:1398) > > > at > > > > > > kafka.server.AlterIsrManagerImpl.$anonfun$handleAlterIsrResponse$8(AlterIsrManager.scala:166) > > > at > > > > > > kafka.server.AlterIsrManagerImpl.$anonfun$handleAlterIsrResponse$8$adapted(AlterIsrManager.scala:163) > > > at scala.collection.immutable.List.foreach(List.scala:333) > > > at > > > > > > kafka.server.AlterIsrManagerImpl.handleAlterIsrResponse(AlterIsrManager.scala:163) > > > at > > > > > > kafka.server.AlterIsrManagerImpl.responseHandler$1(AlterIsrManager.scala:94) > > > at > > > > > > kafka.server.AlterIsrManagerImpl.$anonfun$sendRequest$2(AlterIsrManager.scala:104) > > > at > > > > > > kafka.server.BrokerToControllerRequestThread.handleResponse(BrokerToControllerChannelManagerImpl.scala:175) > > > at > > > > > > kafka.server.BrokerToControllerRequestThread.$anonfun$generateRequests$1(BrokerToControllerChannelManagerImpl.scala:158) > > > at > > > > > > org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:109) > > > at > > > > > > org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:586) > > > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:578) > > > at > > kafka.common.InterBrokerSendThread.doWork(InterBrokerSendThread.scala:71) > > > at > > > > > > kafka.server.BrokerToControllerRequestThread.doWork(BrokerToControllerChannelManagerImpl.scala:183) > > > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96) > > > Looks like some kind of race condition bug...anyone has any idea ? > > > > > > Thanks, > > > Dhirendra > > > > > > > > -- > > > > Tom Cooper > > > > @tomncooper <https://twitter.com/tomncooper> | tomcooper.dev < > > https://tomcooper.dev> > > > > > > > > -- > > > > Tom Cooper > > > > @tomncooper <https://twitter.com/tomncooper> | tomcooper.dev < > > https://tomcooper.dev> > > > > >