[ https://issues.apache.org/jira/browse/KAFKA-10371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jason Gustafson updated KAFKA-10371: ------------------------------------ Affects Version/s: (was: 2.7.0) > Partition reassignments can result in crashed ReplicaFetcherThreads. > -------------------------------------------------------------------- > > Key: KAFKA-10371 > URL: https://issues.apache.org/jira/browse/KAFKA-10371 > Project: Kafka > Issue Type: Bug > Components: core > Reporter: Steve Rodrigues > Assignee: David Jacot > Priority: Critical > > A Kafka system doing partition reassignments got stuck with the reassignment > partially done and the system with a non-zero number of URPs and increasing > max lag. > Looking in the logs, we see: > {noformat} > [ERROR] 2020-07-31 21:22:23,984 [ReplicaFetcherThread-0-3] > kafka.server.ReplicaFetcherThread - [ReplicaFetcher replicaId=4, leaderId=3, > fetcherId=0] Error due to > org.apache.kafka.common.errors.NotLeaderOrFollowerException: Error while > fetching partition state for foo > [INFO] 2020-07-31 21:22:23,986 [ReplicaFetcherThread-0-3] > kafka.server.ReplicaFetcherThread - [ReplicaFetcher replicaId=4, leaderId=3, > fetcherId=0] Stopped > {noformat} > Investigating further and with some helpful changes to the exception (which > was not generating a stack trace because it was a client-side exception), we > see on a test run: > {noformat} > [2020-08-06 19:58:21,592] ERROR [ReplicaFetcher replicaId=2, leaderId=1, > fetcherId=0] Error due to (kafka.server.ReplicaFetcherThread) > org.apache.kafka.common.errors.NotLeaderOrFollowerException: Error while > fetching partition state for topic-test-topic-85 > at org.apache.kafka.common.protocol.Errors.exception(Errors.java:415) > at > kafka.server.ReplicaManager.getPartitionOrException(ReplicaManager.scala:645) > at > kafka.server.ReplicaManager.localLogOrException(ReplicaManager.scala:672) > at > kafka.server.ReplicaFetcherThread.logStartOffset(ReplicaFetcherThread.scala:133) > at > kafka.server.ReplicaFetcherThread.$anonfun$buildFetch$1(ReplicaFetcherThread.scala:316) > at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:553) > at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:551) > at scala.collection.AbstractIterable.foreach(Iterable.scala:920) > at > kafka.server.ReplicaFetcherThread.buildFetch(ReplicaFetcherThread.scala:309) > {noformat} > It appears that the fetcher is attempting to fetch for a partition that has > been getting reassigned away. From further investigation, it seems that in > KAFKA-10002 the StopReplica code was changed from: > 1. Remove partition from fetcher > 2. Remove partition from partition map > to the other way around, but now the fetcher may race and attempt to build a > fetch for a partition that's no longer mapped. In particular, since the > logOrException code is being called from logStartOffset which isn't protected > against NotLeaderOrFollowerException, just against KafkaStorageException, the > exception isn't caught and throws all the way out, killing the replica > fetcher thread. > We need to switch this back. -- This message was sent by Atlassian Jira (v8.3.4#803005)