Wenbing Shen created KAFKA-12157: ------------------------------------ Summary: test Upgrade 2.7.0 from 2.0.0 occur a question Key: KAFKA-12157 URL: https://issues.apache.org/jira/browse/KAFKA-12157 Project: Kafka Issue Type: Bug Components: log Affects Versions: 2.7.0 Reporter: Wenbing Shen Attachments: 1001server.log, 1001serverlog.png, 1003server.log, 1003serverlog.png, 1003statechange.log
I was in a test environment, rolling upgrade from version 2.0.0 to version 2.7.0, and encountered the following problems. When the rolling upgrade progressed to the second round, I stopped the first broker(1001) in the second round and the following error occurred. When an agent processes the client producer request, the starting offset of the leader epoch of the partition leader suddenly becomes 0, and then continues to process write requests for the same partition, and an error log will appear.All partition leaders with 1001 replicas are transferred to the 1003 node, and these partitions on the 1003 node will generate this error if they receive production requests.When I restart 1001, the 1001 broker will report the following error: [2021-01-06 16:46:55,955] ERROR (ReplicaFetcherThread-8-1003 kafka.server.ReplicaFetcherThread 76) [ReplicaFetcher replicaId=1001, leaderId=1003, fetcherId=8] Unexpected error occurred while processing data for partition test-perf1-9 at offset 9666953 I use the following command to make a production request: nohup /home/kafka/software/kafka/bin/kafka-producer-perf-test.sh --num-records 1000000000000 --record-size 1000 --throughput 30000 --producer-props bootstrap.servers=hdp1:9092,hdp2:9092,hdp3:9092 acks=1 --topic test-perf1 > 1pro.log 2>&1 & I tried to reproduce the problem again, but after three attempts, it did not reappear. I am curious how this problem occurred and why the 1003 broker resets startOffset to 0 of leaderEpoch 4 when the offset is assigned by broker in Log.append function. broker 1003: server.log [2021-01-06 16:37:59,492] WARN (data-plane-kafka-request-handler-131 kafka.server.epoch.LeaderEpochFileCache 70) [LeaderEpochCache test-perf1-9] New epoch en try EpochEntry(epoch=4, startOffset=0) caused truncation of conflicting entries ListBuffer(EpochEntry(epoch=4, startOffset=9667122), EpochEntry(epoch=3, star tOffset=9195729), EpochEntry(epoch=2, startOffset=8348201)). Cache now contains 0 entries. [2021-01-06 16:37:59,493] WARN (data-plane-kafka-request-handler-131 kafka.server.epoch.LeaderEpochFileCache 70) [LeaderEpochCache test-perf1-8] New epoch en try EpochEntry(epoch=3, startOffset=0) caused truncation of conflicting entries ListBuffer(EpochEntry(epoch=3, startOffset=9667478), EpochEntry(epoch=2, star tOffset=9196127), EpochEntry(epoch=1, startOffset=8342787)). Cache now contains 0 entries. [2021-01-06 16:37:59,495] WARN (data-plane-kafka-request-handler-131 kafka.server.epoch.LeaderEpochFileCache 70) [LeaderEpochCache test-perf1-2] New epoch en try EpochEntry(epoch=3, startOffset=0) caused truncation of conflicting entries ListBuffer(EpochEntry(epoch=3, startOffset=9667478), EpochEntry(epoch=2, star tOffset=9196127), EpochEntry(epoch=1, startOffset=8336727)). Cache now contains 0 entries. [2021-01-06 16:37:59,498] ERROR (data-plane-kafka-request-handler-142 kafka.server.ReplicaManager 76) [ReplicaManager broker=1003] Error processing append op eration on partition test-perf1-9 java.lang.IllegalArgumentException: Received invalid partition leader epoch entry EpochEntry(epoch=4, startOffset=-3) at kafka.server.epoch.LeaderEpochFileCache.assign(LeaderEpochFileCache.scala:67) at kafka.server.epoch.LeaderEpochFileCache.assign(LeaderEpochFileCache.scala:59) at kafka.log.Log.maybeAssignEpochStartOffset(Log.scala:1268) at kafka.log.Log.$anonfun$append$6(Log.scala:1181) at kafka.log.Log$$Lambda$935/184936331.accept(Unknown Source) at java.lang.Iterable.forEach(Iterable.java:75) at kafka.log.Log.$anonfun$append$2(Log.scala:1179) at kafka.log.Log.append(Log.scala:2387) at kafka.log.Log.appendAsLeader(Log.scala:1050) at kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:1079) at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:1067) at kafka.server.ReplicaManager.$anonfun$appendToLocalLog$4(ReplicaManager.scala:953) at kafka.server.ReplicaManager$$Lambda$1025/1369541490.apply(Unknown Source) at scala.collection.StrictOptimizedMapOps.map(StrictOptimizedMapOps.scala:28) at scala.collection.StrictOptimizedMapOps.map$(StrictOptimizedMapOps.scala:27) at scala.collection.mutable.HashMap.map(HashMap.scala:35) at kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:941) at kafka.server.ReplicaManager.appendRecords(ReplicaManager.scala:621) at kafka.server.KafkaApis.handleProduceRequest(KafkaApis.scala:625) broker 1001:server.log [2021-01-06 16:46:55,955] ERROR (ReplicaFetcherThread-8-1003 kafka.server.ReplicaFetcherThread 76) [ReplicaFetcher replicaId=1001, leaderId=1003, fetcherId=8] Unexpected error occurred while processing data for partition test-perf1-9 at offset 9666953 kafka.common.OffsetsOutOfOrderException: Out of order offsets found in append to test-perf1-9: ArrayBuffer(9666953, 9666954, 9666955, 9666956, 9666957, 9666958, 9666959, 9666960, 9666961, 9666962, 9666963, 9666964, 9666965, 9666966, 9666967, 9666968, 9666969, 9666970, 9666971, 9666972, 9666973, 9666974, 9666975, 9666976, 9666977, 9666978, 9666979, 9666980, 9666981, 9666982, 9666983, 9666984, 9666985, 9666986, 9666987, 9666988, 9666989, 9666990, 9666991, 9666992, 9666993, 9666994, 9666995, 9666996, 9666997, 9666998, 9666999, 9667000, 9667001, 9667002, 9667003, 9667004, 9667005, 9667006, 9667007, 9667008, 9667009, 9667010, 9667011, 9667012, 9667013, 9667014, 9667015, 9667016, 9667017, 9667018, 9667019, 9667020, 9667021, 9667022, 9667023, 9667024, 9667025, 9667026, 9667027, 9667028, 9667029, 9667030, 9667031, 9667032, 9667033, 9667034, 9667035, 9667036, 9667037, 9667038, 9667039, 9667040, 9667041, 9667042, 9667043, 9667044, 9667045, 9667046, 9667047, 9667048, 9667049, 9667050, 9667051, 9667052, 9667053, 9667054, 9667055, 9667056, 9667057, 9667058, 9667059, 9667060, 9667061, 9667062, 9667063, 9667064, 9667065, 9667066, 9667067, 9667068, 9667069, 9667070, 9667071, 9667072, 9667073, 9667074, 9667075, 9667076, 9667077, 9667078... -- This message was sent by Atlassian Jira (v8.3.4#803005)