[ 
https://issues.apache.org/jira/browse/KAFKA-9839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17085702#comment-17085702
 ] 

Sigurd Sandve commented on KAFKA-9839:
--------------------------------------

We experienced the same issue, adding some logs if it could be helpful. 

After this happened this server got stuck in a loop for an hour where ISR was 
not in sync, and it wasn't fixed until a hard restart was done on the server.
{code:java}
Apr 16 10:01:11 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 
10:01:11,527] WARN Client session timed out, have not heard from server in 
6981ms for sessionid 0x50000484a370021 (org.apache.zookeeper.ClientCnxn)
Apr 16 10:01:11 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 
10:01:11,892] INFO Client session timed out, have not heard from server in 
6981ms for sessionid 0x50000484a370021, closing socket connection and 
attempting reconnect (org.apache.zookeeper.ClientCnxn)
Apr 16 10:01:12 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 
10:01:12,930] INFO Opening socket connection to server
Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 
10:01:12,930] INFO Socket connection established to 
Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 
10:01:12,932] WARN Unable to reconnect to ZooKeeper service, session 
0x50000484a370021 has expired (org.apache.zookeeper.ClientCnxn)
Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 
10:01:12,932] INFO Unable to reconnect to ZooKeeper service, session 
0x50000484a370021 has expired, closing socket connection 
(org.apache.zookeeper.ClientCnxn)
Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 
10:01:12,932] INFO EventThread shut down for session: 0x50000484a370021 
(org.apache.zookeeper.ClientCnxn)
Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 
10:01:12,932] INFO [ZooKeeperClient Kafka server] Session expired. 
(kafka.zookeeper.ZooKeeperClient)
Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 
10:01:12,934] INFO [ZooKeeperClient Kafka server] Initializing a new session to 
<1><2><3><4><5>. (kafka.zookeepe
Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 
10:01:12,934] INFO Initiating client connection, connectString=<1><2><3><4><5> 
sessionTimeout=6000 watcher=kafka
Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 
10:01:12,934] INFO Creating /brokers/ids/0 (is it secure? false) 
(kafka.zk.KafkaZkClient)
Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 
10:01:13,042] INFO Opening socket connection to server <1> Will not attempt to 
authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 
10:01:13,042] INFO Socket connection established to <1>, initiating session 
(org.apache.zookeeper.ClientCnxn)
Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 
10:01:13,832] INFO Session establishment complete on server <1>, sessionid = 
0x10000008d93001b, negotiated timeout = 6000 (org.apache.zookeeper.ClientCnxn)
Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 
10:01:13,874] ERROR [KafkaApi-0] Error when handling request: clientId=22, 
correlationId=0, api=UPDATE_METADATA, 
body={controller_id=22,controller_epoch=83,broker_epoch=25770069286,topic_states=[{topic
Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: 
=47,replicas=[0],offline_replicas=[]},{partition=54,controller_epoch=83,leader=-1,leader_epoch=47,isr=[0],zk_version=47,replicas=[0],offline_replicas=[]},{partition=55,controller_epoch=83,leader=-1,leader_epoch=47
Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: 
java.lang.IllegalStateException: Epoch 25770069286 larger than current broker 
epoch 25770067254
Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]:         at 
kafka.server.KafkaApis.isBrokerEpochStale(KafkaApis.scala:2612)
Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]:         at 
kafka.server.KafkaApis.handleUpdateMetadataRequest(KafkaApis.scala:242)
Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]:         at 
kafka.server.KafkaApis.handle(KafkaApis.scala:119)
Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]:         at 
kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:69)
Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]:         at 
java.lang.Thread.run(Thread.java:748)
Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 
10:01:13,885] INFO Stat of the created znode at /brokers/ids/0 is: 
25770069286,25770069286,1587024073851,1587024073851,1,0,0,72057596413149211,200,0,25770069286
Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]:  
(kafka.zk.KafkaZkClient)
Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 
10:01:13,885] INFO Registered broker 0 at path /brokers/ids/0 with addresses: 
ArrayBuffer(EndPoint(192.168.220.120,9092,ListenerName(PLAINTEXT),PLAINTEXT)), 
czxid (broker epoch): 25770069286 (kafka.zk.
Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 
10:01:13,886] ERROR [KafkaApi-0] Error when handling request: clientId=22, 
correlationId=1, api=LEADER_AND_ISR, 
body={controller_id=22,controller_epoch=83,broker_epoch=25770069286,topic_states=[{topic=
Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: 
poch=47,isr=[0],zk_version=47,replicas=[0],is_new=false},{partition=85,controller_epoch=83,leader=-1,leader_epoch=47,isr=[0],zk_version=47,replicas=[0],is_new=false},{partition=86,controller_epoch=83,leader=-1,lea
Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]: 
java.lang.IllegalStateException: Epoch 25770069286 larger than current broker 
epoch 25770067254
Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]:         at 
kafka.server.KafkaApis.isBrokerEpochStale(KafkaApis.scala:2612)
Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]:         at 
kafka.server.KafkaApis.handleLeaderAndIsrRequest(KafkaApis.scala:194)
Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]:         at 
kafka.server.KafkaApis.handle(KafkaApis.scala:117)
Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]:         at 
kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:69)
Apr 16 10:01:13 VM-IWM-APP18 kafka-server-start.sh[9200]:         at 
java.lang.Thread.run(Thread.java:748)
Apr 16 10:01:14 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 
10:01:14,867] INFO [ReplicaFetcherManager on broker 0] Removed fetcher for 
partitions Set(plant_alarms-64, plant_values-146, plant_alarms-8, 
plant_alarms-0, plant_alarms-80, plant_alarms-128, plant_ala
Apr 16 10:01:14 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 
10:01:14,867] INFO [Partition plant_alarms-48 broker=0] plant_alarms-48 starts 
at Leader Epoch 48 from offset 0. Previous Leader Epoch was: 46 
(kafka.cluster.Partition)
Apr 16 10:01:15 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 
10:01:15,119] INFO [Partition plant_alarms-10 broker=0] plant_alarms-10 starts 
at Leader Epoch 48 from offset 0. Previous Leader Epoch was: 46 
(kafka.cluster.Partition)
Apr 16 10:01:15 VM-IWM-APP18 kafka-server-start.sh[9200]: [2020-04-16 
10:01:15,356] INFO [Partition plant_values-6 broker=0] plant_values-6 starts at 
Leader Epoch 48 from offset 76. Previous Leader Epoch was: 46 
(kafka.cluster.Partition)
...
...
Apr 16 10:58:15 VM-IWM-APP18 kafka-server-start.sh[32612]: [2020-04-16 
10:58:15,130] INFO [Partition demo_plant_values-140 broker=0] 
demo_plant_values-140 starts at Leader Epoch 98 from offset 30769106. Previous 
Leader Epoch was: 97 (kafka.cluster.Partition)
Apr 16 10:58:15 VM-IWM-APP18 kafka-server-start.sh[32612]: [2020-04-16 
10:58:15,135] INFO [Partition demo_plant_values-102 broker=0] 
demo_plant_values-102 starts at Leader Epoch 98 from offset 35753967. Previous 
Leader Epoch was: 97 (kafka.cluster.Partition)


{code}
 

> IllegalStateException on metadata update when broker learns about its new 
> epoch after the controller
> ----------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-9839
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9839
>             Project: Kafka
>          Issue Type: Bug
>          Components: controller, core
>    Affects Versions: 2.3.1
>            Reporter: Anna Povzner
>            Assignee: Anna Povzner
>            Priority: Critical
>
> Broker throws "java.lang.IllegalStateException: Epoch XXX larger than current 
> broker epoch YYY"  on UPDATE_METADATA when the controller learns about the 
> broker epoch and sends UPDATE_METADATA before KafkaZkCLient.registerBroker 
> completes (the broker learns about its new epoch).
> Here is the scenario we observed in more detail:
> 1. ZK session expires on broker 1
> 2. Broker 1 establishes new session to ZK and creates znode
> 3. Controller learns about broker 1 and assigns epoch
> 4. Broker 1 receives UPDATE_METADATA from controller, but it does not know 
> about its new epoch yet, so we get an exception:
> ERROR [KafkaApi-3] Error when handling request: clientId=1, correlationId=0, 
> api=UPDATE_METADATA, body={
> .........
> java.lang.IllegalStateException: Epoch XXX larger than current broker epoch 
> YYY at kafka.server.KafkaApis.isBrokerEpochStale(KafkaApis.scala:2725) at 
> kafka.server.KafkaApis.handleUpdateMetadataRequest(KafkaApis.scala:320) at 
> kafka.server.KafkaApis.handle(KafkaApis.scala:139) at 
> kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:69) at 
> java.lang.Thread.run(Thread.java:748)
> 5. KafkaZkCLient.registerBroker completes on broker 1: "INFO Stat of the 
> created znode at /brokers/ids/1"
> The result is the broker has a stale metadata for some time.
> Possible solutions:
> 1. Broker returns a more specific error and controller retries UPDATE_MEDATA
> 2. Broker accepts UPDATE_METADATA with larger broker epoch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to