[jira] [Commented] (KAFKA-5678) When the broker graceful shutdown occurs, the producer side sends timeout.

Jiangjie Qin (JIRA) Thu, 03 Aug 2017 09:51:25 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-5678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16113064#comment-16113064
 ]


Jiangjie Qin commented on KAFKA-5678:
-------------------------------------

[~cuiyang] 

1. Currently request timeout is used in two places:
A. The actual request timeout on the wire. In this case the producer will retry.

B. When a batch has been sitting in the accumulator for more than request 
timeout and the producer cannot make progress, the batch will be expired, this 
is not retriable. In the original design, in order to make progress, the 
producer needs to know the leader information of a partition and this 
information needs to be up to date. The current implementation of this is a 
little buggy. It checks whether there is an in-flight batch for a partition or 
not. But when max.in.flight.requests is set to 1 and metadata refresh happens, 
this check may fail and expire the batch by mistake. The expiration on the 
producer side is a little trickier than it looks like. KIP-91 is trying to 
address that.

It looks that what you saw was the second case. Setting a higher request 
timeout is the way to go then.

2. The reason this problem happens during controlled shutdown is that during 
controlled shutdown the LeaderAndIsrRequests are not batched, but in other 
leader movement scenarios, the LeaderAndIsrRequests are actually batched. So 
this should not happen.

> When the broker graceful shutdown occurs, the producer side sends timeout.
> --------------------------------------------------------------------------
>
>                 Key: KAFKA-5678
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5678
>             Project: Kafka
>          Issue Type: Improvement
>    Affects Versions: 0.9.0.0, 0.10.0.0, 0.11.0.0
>            Reporter: tuyang
>
> Test environment as follows.
> 1.Kafka version：0.9.0.1
> 2.Cluster with 3 broker which with broker id A,B,C 
> 3.Topic with 6 partitions with 2 replicas，with 2 leader partitions at each 
> broker.
> We can reproduce the problem as follows.
> 1.we send message as quickly as possible with ack -1.
> 2.if partition p0's leader is on broker A and we graceful shutdown broker 
> A，but we send a message to p0 before the leader is reelect, so the message 
> can be appended to the leader replica successful, but if the follower replica 
> not catch it as quickly as possible, so the shutting down broker will create 
> a delayProduce for this request to wait complete until request.timeout.ms .
> 3.because of the controllerShutdown request from broker A, then the p0 
> partition leader will reelect
> , then the replica on broker A will become follower before complete shut 
> down.then the delayProduce will not be trigger to complete until expire. 
> 4.if broker A shutdown cost too long, then the producer will get response 
> after request.timeout.ms, which results in increase the producer send latency 
> when we are restarting broker one by one.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (KAFKA-5678) When the broker graceful shutdown occurs, the producer side sends timeout.

Reply via email to