[ https://issues.apache.org/jira/browse/KAFKA-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jason Gustafson resolved KAFKA-8571. ------------------------------------ Resolution: Fixed > Not complete delayed produce requests when processing StopReplicaRequest > causing high produce latency for acks=all > ------------------------------------------------------------------------------------------------------------------ > > Key: KAFKA-8571 > URL: https://issues.apache.org/jira/browse/KAFKA-8571 > Project: Kafka > Issue Type: Bug > Reporter: Zhanxiang (Patrick) Huang > Assignee: Zhanxiang (Patrick) Huang > Priority: Major > > Currently a broker will only attempt to complete delayed requests upon > highwater mark changes and receiving LeaderAndIsrRequest. When a broker > receives StopReplicaRequest, it will not try to complete delayed operations > including delayed produce for acks=all, which can cause the producer to > timeout even though the producer should have attempted to talk to the new > leader faster if a NotLeaderForPartition error is sent. > This can happen during partition reassignment when controller is trying to > kick the previous leader out of the replica set. It this case, controller > will only send StopReplicaRequest (not LeaderAndIsrRequest) to the previous > leader in the replica set shrink phase. Here is an example: > {noformat} > During Reassign the replica set of partition A from [B1, B2] to [B2, B3]: > t0: Controller expands the replica set to [B1, B2, B3] > t1: B1 receives produce request PR on partition A with acks=all and timetout > T. B1 puts PR into the DelayedProducePurgatory with timeout T. > t2: Controller elects B2 as the new leader and shrinks the replica set fo > [B2, B3]. LeaderAndIsrRequests are sent to B2 and B3. StopReplicaRequest is > sent to B!. > t3: B1 receives StopReplicaRequest but doesn't try to comeplete PR. > If PR cannot be fullfilled by t3, and t1 + T > t3, PR will eventually time > out in the purgatory and producer will eventually time out the produce > request.{noformat} > Since it is possible for the leader to receive only a StopReplicaRequest > (without receiving any LeaderAndIsrRequest) to leave the replica set, a fix > for this issue is to also try to complete delay operations in processing > StopReplicaRequest. > -- This message was sent by Atlassian Jira (v8.3.4#803005)