[ https://issues.apache.org/jira/browse/KAFKA-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15884094#comment-15884094 ]
James Cheng commented on KAFKA-1342: ------------------------------------ [~jjkoshy], [~toddpalino], is it still true that it is unsafe to increase the number of controlled shutdown requests? We currently have brokers with 10,000 partitions each, and there is no way they can effectively shutdown within the shutdown timeout of 30 seconds, even with the current default of controlled.shutdown.max.retries=3. If the brokers aren't able to shutdown within the 90 seconds (30 seconds * 3), then when we bounce them and they start back up too quickly, we end up with a broker with all of its replica fetchers stopped (as described in this JIRA). This also seems like a specific instance of KAFKA-1120 We have increased that to 40 or so, to allow brokers up to 20 minutes to shutdown. Usually, it takes them 8 minutes. Is it better to increase the value of controller.socket.timeout.ms? If we increase this to 25 minutes for example, doesn't that impact much more than just the shutdown request? Won't normal controller->broker communication like LeaderAndIsr and MetadataUpdate requests also be subject to an 25 minute timeout? > Slow controlled shutdowns can result in stale shutdown requests > --------------------------------------------------------------- > > Key: KAFKA-1342 > URL: https://issues.apache.org/jira/browse/KAFKA-1342 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.8.1 > Reporter: Joel Koshy > Assignee: Joel Koshy > Priority: Critical > Labels: newbie++, newbiee, reliability > Fix For: 0.10.3.0 > > > I don't think this is a bug introduced in 0.8.1., but triggered by the fact > that controlled shutdown seems to have become slower in 0.8.1 (will file a > separate ticket to investigate that). When doing a rolling bounce, it is > possible for a bounced broker to stop all its replica fetchers since the > previous PID's shutdown requests are still being shutdown. > - 515 is the controller > - Controlled shutdown initiated for 503 > - Controller starts controlled shutdown for 503 > - The controlled shutdown takes a long time in moving leaders and moving > follower replicas on 503 to the offline state. > - So 503's read from the shutdown channel times out and a new channel is > created. It issues another shutdown request. This request (since it is a > new channel) is accepted at the controller's socket server but then waits > on the broker shutdown lock held by the previous controlled shutdown which > is still in progress. > - The above step repeats for the remaining retries (six more requests). > - 503 hits SocketTimeout exception on reading the response of the last > shutdown request and proceeds to do an unclean shutdown. > - The controller's onBrokerFailure call-back fires and moves 503's replicas > to offline (not too important in this sequence). > - 503 is brought back up. > - The controller's onBrokerStartup call-back fires and moves its replicas > (and partitions) to online state. 503 starts its replica fetchers. > - Unfortunately, the (phantom) shutdown requests are still being handled and > the controller sends StopReplica requests to 503. > - The first shutdown request finally finishes (after 76 minutes in my case!). > - The remaining shutdown requests also execute and do the same thing (sends > StopReplica requests for all partitions to > 503). > - The remaining requests complete quickly because they end up not having to > touch zookeeper paths - no leaders left on the broker and no need to > shrink ISR in zookeeper since it has already been done by the first > shutdown request. > - So in the end-state 503 is up, but effectively idle due to the previous > PID's shutdown requests. > There are some obvious fixes that can be made to controlled shutdown to help > address the above issue. E.g., we don't really need to move follower > partitions to Offline. We did that as an "optimization" so the broker falls > out of ISR sooner - which is helpful when producers set required.acks to -1. > However it adds a lot of latency to controlled shutdown. Also, (more > importantly) we should have a mechanism to abort any stale shutdown process. -- This message was sent by Atlassian JIRA (v6.3.15#6346)