[ 
https://issues.apache.org/jira/browse/KAFKA-10371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson updated KAFKA-10371:
------------------------------------
    Affects Version/s:     (was: 2.7.0)

> Partition reassignments can result in crashed ReplicaFetcherThreads.
> --------------------------------------------------------------------
>
>                 Key: KAFKA-10371
>                 URL: https://issues.apache.org/jira/browse/KAFKA-10371
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>            Reporter: Steve Rodrigues
>            Assignee: David Jacot
>            Priority: Critical
>
> A Kafka system doing partition reassignments got stuck with the reassignment 
> partially done and the system with a non-zero number of URPs and increasing 
> max lag.
> Looking in the logs, we see: 
> {noformat}
> [ERROR] 2020-07-31 21:22:23,984 [ReplicaFetcherThread-0-3] 
> kafka.server.ReplicaFetcherThread - [ReplicaFetcher replicaId=4, leaderId=3, 
> fetcherId=0] Error due to
> org.apache.kafka.common.errors.NotLeaderOrFollowerException: Error while 
> fetching partition state for foo
> [INFO] 2020-07-31 21:22:23,986 [ReplicaFetcherThread-0-3] 
> kafka.server.ReplicaFetcherThread - [ReplicaFetcher replicaId=4, leaderId=3, 
> fetcherId=0] Stopped
> {noformat}
> Investigating further and with some helpful changes to the exception (which 
> was not generating a stack trace because it was a client-side exception), we 
> see on a test run:
> {noformat}
> [2020-08-06 19:58:21,592] ERROR [ReplicaFetcher replicaId=2, leaderId=1, 
> fetcherId=0] Error due to (kafka.server.ReplicaFetcherThread)
> org.apache.kafka.common.errors.NotLeaderOrFollowerException: Error while 
> fetching partition state for topic-test-topic-85
>         at org.apache.kafka.common.protocol.Errors.exception(Errors.java:415)
>         at 
> kafka.server.ReplicaManager.getPartitionOrException(ReplicaManager.scala:645)
>         at 
> kafka.server.ReplicaManager.localLogOrException(ReplicaManager.scala:672)
>         at 
> kafka.server.ReplicaFetcherThread.logStartOffset(ReplicaFetcherThread.scala:133)
>         at 
> kafka.server.ReplicaFetcherThread.$anonfun$buildFetch$1(ReplicaFetcherThread.scala:316)
>         at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:553)
>         at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:551)
>         at scala.collection.AbstractIterable.foreach(Iterable.scala:920)
>         at 
> kafka.server.ReplicaFetcherThread.buildFetch(ReplicaFetcherThread.scala:309)
> {noformat}
> It appears that the fetcher is attempting to fetch for a partition that has 
> been getting reassigned away. From further investigation, it seems that in 
> KAFKA-10002 the StopReplica code was changed from:
> 1. Remove partition from fetcher
> 2. Remove partition from partition map
> to the other way around, but now the fetcher may race and attempt to build a 
> fetch for a partition that's no longer mapped.  In particular, since the 
> logOrException code is being called from logStartOffset which isn't protected 
> against NotLeaderOrFollowerException, just against KafkaStorageException, the 
> exception isn't caught and throws all the way out, killing the replica 
> fetcher thread.
> We need to switch this back.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to