[ https://issues.apache.org/jira/browse/KAFKA-9839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17085702#comment-17085702 ]
Sigurd Sandve commented on KAFKA-9839: -------------------------------------- We experienced the same issue, adding some logs if it could be helpful. After this happened this server got stuck in a loop for an hour where ISR was not in sync, and it wasn't fixed until a hard restart was done on the server. {code:java} Apr 16 10:01:11 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 10:01:11,527] WARN Client session timed out, have not heard from server in 6981ms for sessionid 0x50000484a370021 (org.apache.zookeeper.ClientCnxn) Apr 16 10:01:11 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 10:01:11,892] INFO Client session timed out, have not heard from server in 6981ms for sessionid 0x50000484a370021, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) Apr 16 10:01:12 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 10:01:12,930] INFO Opening socket connection to server Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 10:01:12,930] INFO Socket connection established to Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 10:01:12,932] WARN Unable to reconnect to ZooKeeper service, session 0x50000484a370021 has expired (org.apache.zookeeper.ClientCnxn) Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 10:01:12,932] INFO Unable to reconnect to ZooKeeper service, session 0x50000484a370021 has expired, closing socket connection (org.apache.zookeeper.ClientCnxn) Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 10:01:12,932] INFO EventThread shut down for session: 0x50000484a370021 (org.apache.zookeeper.ClientCnxn) Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 10:01:12,932] INFO [ZooKeeperClient Kafka server] Session expired. (kafka.zookeeper.ZooKeeperClient) Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 10:01:12,934] INFO [ZooKeeperClient Kafka server] Initializing a new session to <1><2><3><4><5>. (kafka.zookeepe Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 10:01:12,934] INFO Initiating client connection, connectString=<1><2><3><4><5> sessionTimeout=6000 watcher=kafka Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 10:01:12,934] INFO Creating /brokers/ids/0 (is it secure? false) (kafka.zk.KafkaZkClient) Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 10:01:13,042] INFO Opening socket connection to server <1> Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 10:01:13,042] INFO Socket connection established to <1>, initiating session (org.apache.zookeeper.ClientCnxn) Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 10:01:13,832] INFO Session establishment complete on server <1>, sessionid = 0x10000008d93001b, negotiated timeout = 6000 (org.apache.zookeeper.ClientCnxn) Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 10:01:13,874] ERROR [KafkaApi-0] Error when handling request: clientId=22, correlationId=0, api=UPDATE_METADATA, body={controller_id=22,controller_epoch=83,broker_epoch=25770069286,topic_states=[{topic Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: =47,replicas=[0],offline_replicas=[]},{partition=54,controller_epoch=83,leader=-1,leader_epoch=47,isr=[0],zk_version=47,replicas=[0],offline_replicas=[]},{partition=55,controller_epoch=83,leader=-1,leader_epoch=47 Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: java.lang.IllegalStateException: Epoch 25770069286 larger than current broker epoch 25770067254 Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: at kafka.server.KafkaApis.isBrokerEpochStale(KafkaApis.scala:2612) Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: at kafka.server.KafkaApis.handleUpdateMetadataRequest(KafkaApis.scala:242) Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: at kafka.server.KafkaApis.handle(KafkaApis.scala:119) Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:69) Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: at java.lang.Thread.run(Thread.java:748) Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 10:01:13,885] INFO Stat of the created znode at /brokers/ids/0 is: 25770069286,25770069286,1587024073851,1587024073851,1,0,0,72057596413149211,200,0,25770069286 Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: (kafka.zk.KafkaZkClient) Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 10:01:13,885] INFO Registered broker 0 at path /brokers/ids/0 with addresses: ArrayBuffer(EndPoint(192.168.220.120,9092,ListenerName(PLAINTEXT),PLAINTEXT)), czxid (broker epoch): 25770069286 (kafka.zk. Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 10:01:13,886] ERROR [KafkaApi-0] Error when handling request: clientId=22, correlationId=1, api=LEADER_AND_ISR, body={controller_id=22,controller_epoch=83,broker_epoch=25770069286,topic_states=[{topic= Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: poch=47,isr=[0],zk_version=47,replicas=[0],is_new=false},{partition=85,controller_epoch=83,leader=-1,leader_epoch=47,isr=[0],zk_version=47,replicas=[0],is_new=false},{partition=86,controller_epoch=83,leader=-1,lea Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: java.lang.IllegalStateException: Epoch 25770069286 larger than current broker epoch 25770067254 Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: at kafka.server.KafkaApis.isBrokerEpochStale(KafkaApis.scala:2612) Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: at kafka.server.KafkaApis.handleLeaderAndIsrRequest(KafkaApis.scala:194) Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: at kafka.server.KafkaApis.handle(KafkaApis.scala:117) Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:69) Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: at java.lang.Thread.run(Thread.java:748) Apr 16 10:01:14 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 10:01:14,867] INFO [ReplicaFetcherManager on broker 0] Removed fetcher for partitions Set(plant_alarms-64, plant_values-146, plant_alarms-8, plant_alarms-0, plant_alarms-80, plant_alarms-128, plant_ala Apr 16 10:01:14 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 10:01:14,867] INFO [Partition plant_alarms-48 broker=0] plant_alarms-48 starts at Leader Epoch 48 from offset 0. Previous Leader Epoch was: 46 (kafka.cluster.Partition) Apr 16 10:01:15 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 10:01:15,119] INFO [Partition plant_alarms-10 broker=0] plant_alarms-10 starts at Leader Epoch 48 from offset 0. Previous Leader Epoch was: 46 (kafka.cluster.Partition) Apr 16 10:01:15 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 10:01:15,356] INFO [Partition plant_values-6 broker=0] plant_values-6 starts at Leader Epoch 48 from offset 76. Previous Leader Epoch was: 46 (kafka.cluster.Partition) ... ... Apr 16 10:58:15 VM-IWM-APP18 kafka-server-start.sh[32612]: [2020-04-16 10:58:15,130] INFO [Partition demo_plant_values-140 broker=0] demo_plant_values-140 starts at Leader Epoch 98 from offset 30769106. Previous Leader Epoch was: 97 (kafka.cluster.Partition) Apr 16 10:58:15 VM-IWM-APP18 kafka-server-start.sh[32612]: [2020-04-16 10:58:15,135] INFO [Partition demo_plant_values-102 broker=0] demo_plant_values-102 starts at Leader Epoch 98 from offset 35753967. Previous Leader Epoch was: 97 (kafka.cluster.Partition) {code} > IllegalStateException on metadata update when broker learns about its new > epoch after the controller > ---------------------------------------------------------------------------------------------------- > > Key: KAFKA-9839 > URL: https://issues.apache.org/jira/browse/KAFKA-9839 > Project: Kafka > Issue Type: Bug > Components: controller, core > Affects Versions: 2.3.1 > Reporter: Anna Povzner > Assignee: Anna Povzner > Priority: Critical > > Broker throws "java.lang.IllegalStateException: Epoch XXX larger than current > broker epoch YYY" on UPDATE_METADATA when the controller learns about the > broker epoch and sends UPDATE_METADATA before KafkaZkCLient.registerBroker > completes (the broker learns about its new epoch). > Here is the scenario we observed in more detail: > 1. ZK session expires on broker 1 > 2. Broker 1 establishes new session to ZK and creates znode > 3. Controller learns about broker 1 and assigns epoch > 4. Broker 1 receives UPDATE_METADATA from controller, but it does not know > about its new epoch yet, so we get an exception: > ERROR [KafkaApi-3] Error when handling request: clientId=1, correlationId=0, > api=UPDATE_METADATA, body={ > ......... > java.lang.IllegalStateException: Epoch XXX larger than current broker epoch > YYY at kafka.server.KafkaApis.isBrokerEpochStale(KafkaApis.scala:2725) at > kafka.server.KafkaApis.handleUpdateMetadataRequest(KafkaApis.scala:320) at > kafka.server.KafkaApis.handle(KafkaApis.scala:139) at > kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:69) at > java.lang.Thread.run(Thread.java:748) > 5. KafkaZkCLient.registerBroker completes on broker 1: "INFO Stat of the > created znode at /brokers/ids/1" > The result is the broker has a stale metadata for some time. > Possible solutions: > 1. Broker returns a more specific error and controller retries UPDATE_MEDATA > 2. Broker accepts UPDATE_METADATA with larger broker epoch. -- This message was sent by Atlassian Jira (v8.3.4#803005)