[ https://issues.apache.org/jira/browse/KAFKA-3436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15884390#comment-15884390 ]
Jiangjie Qin commented on KAFKA-3436: ------------------------------------- [~onurkaraman] is currently working on rewrite controller. The latest trunk already has some controlled shutdown performance improvement by batching the partitions. Have you got a chance to try? > Speed up controlled shutdown. > ----------------------------- > > Key: KAFKA-3436 > URL: https://issues.apache.org/jira/browse/KAFKA-3436 > Project: Kafka > Issue Type: Improvement > Affects Versions: 0.9.0.0 > Reporter: Jiangjie Qin > Assignee: Jiangjie Qin > Fix For: 0.10.3.0 > > > Currently rolling bounce a Kafka cluster with tens of thousands of partitions > can take very long (~2 min for each broker with ~5000 partitions/broker in > our environment). The majority of the time is spent on shutting down a > broker. The time of shutting down a broker usually includes the following > parts: > T1: During controlled shutdown, people usually want to make sure there is no > under replicated partitions. So shutting down a broker during a rolling > bounce will have to wait for the previous restarted broker to catch up. This > is T1. > T2: The time to send controlled shutdown request and receive controlled > shutdown response. Currently the a controlled shutdown request will trigger > many LeaderAndIsrRequest and UpdateMetadataRequest. And also involving many > zookeeper update in serial. > T3: The actual time to shutdown all the components. It is usually small > compared with T1 and T2. > T1 is related to: > A) the inbound throughput on the cluster, and > B) the "down" time of the broker (time between replica fetchers stop and > replica fetchers restart) > The larger the traffic is, or the longer the broker stopped fetching, the > longer it will take for the broker to catch up and get back into ISR. > Therefore the longer T1 will be. Assume: > * the in bound network traffic is X bytes/second on a broker > * the time T1.B ("down" time) mentioned above is T > Theoretically it will take (X * T) / (NetworkBandwidth - X) = > InBoundNetworkUtilization * T / (1 - InboundNetworkUtilization) for a the > broker to catch up after the restart. While X is out of our control, T is > largely related to T2. > The purpose of this ticket is to reduce T2 by: > 1. Batching the LeaderAndIsrRequest and UpdateMetadataRequest during > controlled shutdown. > 2. Use async zookeeper write to pipeline zookeeper writes. According to > Zookeeper wiki(https://wiki.apache.org/hadoop/ZooKeeper/Performance), a 3 > node ZK cluster should be able to handle 20K writes (1K size). So if we use > async write, likely we will be able to reduce zookeeper update time to lower > seconds or even sub-second level. -- This message was sent by Atlassian JIRA (v6.3.15#6346)