[jira] [Commented] (KAFKA-1342) Slow controlled shutdowns can result in stale shutdown requests

James Cheng (JIRA) Fri, 24 Feb 2017 23:11:19 -0800

    [ 
https://issues.apache.org/jira/browse/KAFKA-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15884094#comment-15884094
 ]


James Cheng commented on KAFKA-1342:
------------------------------------

[~jjkoshy], [~toddpalino], is it still true that it is unsafe to increase the 
number of controlled shutdown requests? We currently have brokers with 10,000 
partitions each, and there is no way they can effectively shutdown within the 
shutdown timeout of 30 seconds, even with the current default of 
controlled.shutdown.max.retries=3. If the brokers aren't able to shutdown 
within the 90 seconds (30 seconds * 3), then when we bounce them and they start 
back up too quickly, we end up with a broker with all of its replica fetchers 
stopped (as described in this JIRA). This also seems like a specific instance 
of KAFKA-1120

We have increased that to 40 or so, to allow brokers up to 20 minutes to 
shutdown. Usually, it takes them 8 minutes.

Is it better to increase the value of controller.socket.timeout.ms? If we 
increase this to 25 minutes for example, doesn't that impact much more than 
just the shutdown request? Won't normal controller->broker communication like 
LeaderAndIsr and MetadataUpdate requests also be subject to an 25 minute 
timeout?


> Slow controlled shutdowns can result in stale shutdown requests
> ---------------------------------------------------------------
>
>                 Key: KAFKA-1342
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1342
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8.1
>            Reporter: Joel Koshy
>            Assignee: Joel Koshy
>            Priority: Critical
>              Labels: newbie++, newbiee, reliability
>             Fix For: 0.10.3.0
>
>
> I don't think this is a bug introduced in 0.8.1., but triggered by the fact
> that controlled shutdown seems to have become slower in 0.8.1 (will file a
> separate ticket to investigate that). When doing a rolling bounce, it is
> possible for a bounced broker to stop all its replica fetchers since the
> previous PID's shutdown requests are still being shutdown.
> - 515 is the controller
> - Controlled shutdown initiated for 503
> - Controller starts controlled shutdown for 503
> - The controlled shutdown takes a long time in moving leaders and moving
>   follower replicas on 503 to the offline state.
> - So 503's read from the shutdown channel times out and a new channel is
>   created. It issues another shutdown request.  This request (since it is a
>   new channel) is accepted at the controller's socket server but then waits
>   on the broker shutdown lock held by the previous controlled shutdown which
>   is still in progress.
> - The above step repeats for the remaining retries (six more requests).
> - 503 hits SocketTimeout exception on reading the response of the last
>   shutdown request and proceeds to do an unclean shutdown.
> - The controller's onBrokerFailure call-back fires and moves 503's replicas
>   to offline (not too important in this sequence).
> - 503 is brought back up.
> - The controller's onBrokerStartup call-back fires and moves its replicas
>   (and partitions) to online state. 503 starts its replica fetchers.
> - Unfortunately, the (phantom) shutdown requests are still being handled and
>   the controller sends StopReplica requests to 503.
> - The first shutdown request finally finishes (after 76 minutes in my case!).
> - The remaining shutdown requests also execute and do the same thing (sends
>   StopReplica requests for all partitions to
>   503).
> - The remaining requests complete quickly because they end up not having to
>   touch zookeeper paths - no leaders left on the broker and no need to
>   shrink ISR in zookeeper since it has already been done by the first
>   shutdown request.
> - So in the end-state 503 is up, but effectively idle due to the previous
>   PID's shutdown requests.
> There are some obvious fixes that can be made to controlled shutdown to help
> address the above issue. E.g., we don't really need to move follower
> partitions to Offline. We did that as an "optimization" so the broker falls
> out of ISR sooner - which is helpful when producers set required.acks to -1.
> However it adds a lot of latency to controlled shutdown. Also, (more
> importantly) we should have a mechanism to abort any stale shutdown process.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (KAFKA-1342) Slow controlled shutdowns can result in stale shutdown requests

Reply via email to