Steve Rodrigues created KAFKA-10371:
---------------------------------------

             Summary: Partition reassignments can result in crashed 
ReplicaFetcherThreads.
                 Key: KAFKA-10371
                 URL: https://issues.apache.org/jira/browse/KAFKA-10371
             Project: Kafka
          Issue Type: Bug
          Components: core
    Affects Versions: 2.7.0
            Reporter: Steve Rodrigues


A Kafka system doing partition reassignments got stuck with the reassignment 
partially done and the system with a non-zero number of URPs and increasing max 
lag.

Looking in the logs, we see: 
{noformat}
[ERROR] 2020-07-31 21:22:23,984 [ReplicaFetcherThread-0-3] 
kafka.server.ReplicaFetcherThread - [ReplicaFetcher replicaId=4, leaderId=3, 
fetcherId=0] Error due to
org.apache.kafka.common.errors.NotLeaderOrFollowerException: Error while 
fetching partition state for foo
[INFO] 2020-07-31 21:22:23,986 [ReplicaFetcherThread-0-3] 
kafka.server.ReplicaFetcherThread - [ReplicaFetcher replicaId=4, leaderId=3, 
fetcherId=0] Stopped
{noformat}

Investigating further and with some helpful changes to the exception (which was 
not generating a stack trace because it was a client-side exception), we see on 
a test run:

{noformat}
[2020-08-06 19:58:21,592] ERROR [ReplicaFetcher replicaId=2, leaderId=1, 
fetcherId=0] Error due to (kafka.server.ReplicaFetcherThread)
org.apache.kafka.common.errors.NotLeaderOrFollowerException: Error while 
fetching partition state for topic-test-topic-85
        at org.apache.kafka.common.protocol.Errors.exception(Errors.java:415)
        at 
kafka.server.ReplicaManager.getPartitionOrException(ReplicaManager.scala:645)
        at 
kafka.server.ReplicaManager.localLogOrException(ReplicaManager.scala:672)
        at 
kafka.server.ReplicaFetcherThread.logStartOffset(ReplicaFetcherThread.scala:133)
        at 
kafka.server.ReplicaFetcherThread.$anonfun$buildFetch$1(ReplicaFetcherThread.scala:316)
        at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:553)
        at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:551)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:920)
        at 
kafka.server.ReplicaFetcherThread.buildFetch(ReplicaFetcherThread.scala:309)
{noformat}

It appears that the fetcher is attempting to fetch for a partition that has 
been getting reassigned away. From further investigation, it seems that in 
KAFKA-10002 the StopReplica code was changed from:
1. Remove partition from fetcher
2. Remove partition from partition map
to the other way around, but now the fetcher may race and attempt to build a 
fetch for a partition that's no longer mapped.  In particular, since the 
logOrException code is being called from logStartOffset which isn't protected 
against NotLeaderOrFollowerException, just against KafkaStorageException, the 
exception isn't caught and throws all the way out, killing the replica fetcher 
thread.
We need to switch this back.






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to