[ 
https://issues.apache.org/jira/browse/KAFKA-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-8571.
------------------------------------
    Resolution: Fixed

> Not complete delayed produce requests when processing StopReplicaRequest 
> causing high produce latency for acks=all
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-8571
>                 URL: https://issues.apache.org/jira/browse/KAFKA-8571
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Zhanxiang (Patrick) Huang
>            Assignee: Zhanxiang (Patrick) Huang
>            Priority: Major
>
> Currently a broker will only attempt to complete delayed requests upon 
> highwater mark changes and receiving LeaderAndIsrRequest. When a broker 
> receives StopReplicaRequest, it will not try to complete delayed operations 
> including delayed produce for acks=all, which can cause the producer to 
> timeout even though the producer should have attempted to talk to the new 
> leader faster if a NotLeaderForPartition error is sent.
> This can happen during partition reassignment when controller is trying to 
> kick the previous leader out of the replica set. It this case, controller 
> will only send StopReplicaRequest (not LeaderAndIsrRequest) to the previous 
> leader in the replica set shrink phase. Here is an example:
> {noformat}
> During Reassign the replica set of partition A from [B1, B2] to [B2, B3]:
> t0: Controller expands the replica set to [B1, B2, B3]
> t1: B1 receives produce request PR on partition A with acks=all and timetout 
> T. B1 puts PR into the DelayedProducePurgatory with timeout T.
> t2: Controller elects B2 as the new leader and shrinks the replica set fo 
> [B2, B3]. LeaderAndIsrRequests are sent to B2 and B3. StopReplicaRequest is 
> sent to B!.
> t3: B1 receives StopReplicaRequest but doesn't try to comeplete PR.
> If PR cannot be fullfilled by t3, and t1 + T > t3, PR will eventually time 
> out in the purgatory and producer will eventually time out the produce 
> request.{noformat}
> Since it is possible for the leader to receive only a StopReplicaRequest 
> (without receiving any LeaderAndIsrRequest) to leave the replica set, a fix 
> for this issue is to also try to complete delay operations in processing 
> StopReplicaRequest.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to