[ https://issues.apache.org/jira/browse/KAFKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146250#comment-15146250 ]
Neha Narkhede edited comment on KAFKA-1464 at 2/13/16 11:36 PM: ---------------------------------------------------------------- The most useful resource to throttle for is network bandwidth usage by replication, as measured by the rate of total outgoing replication data on every leader. Adding the ability on every leader to cap data transferred under an upper limit is what we are looking for. This can be a config option similar to the one we have for the log cleaner. It seems to be that it is better to have the leader send less instead of have the replica fetch less as the leader has a holistic view of the total amount of data being transferred out. Data transferred from a leader includes # Fetch requests from an in-sync replica # Fetch requests from an out-of-sync replica of a partition being reassigned # Fetch requests from an out-of-sync replica of a partition not being reassigned Data transferred across 1+2+3 should stay roughly within the configured upper limit. If the limit is crossed, we want to start throttling requests, all except the ones that fall under #1. The leader can assign the remaining available bandwidth amongst partitions that fall under #2 and #3 by allowing more bandwidth to #3 since presumably it is fine to let partitions being reassigned to catch up slower than the rest. Throttling could involve returning fewer bytes as determined by this computation for each such partition as Jay suggests. was (Author: nehanarkhede): The most useful resource to throttle for is network bandwidth usage by replication, as measured by the rate of total outgoing replication data on every leader. Adding the ability on every leader to cap data transferred under an upper limit is what we are looking for. This can be a config option similar to the one we have for the log cleaner. It seems to be that it is better to have the leader send less instead of have the replica fetch less as the leader has a holistic view of the total amount of data being transferred out. Data transferred from a leader includes - Fetch requests from an in-sync replica - Fetch requests from an out-of-sync replica of a partition being reassigned - Fetch requests from an out-of-sync replica of a partition not being reassigned Data transferred across 1+2+3 should stay roughly within the configured upper limit. If the limit is crossed, we want to start throttling requests, all except the ones that fall under #1. The leader can assign the remaining available bandwidth amongst partitions that fall under #2 and #3 by allowing more bandwidth to #3 since presumably it is fine to let partitions being reassigned to catch up slower than the rest. Throttling could involve returning fewer bytes as determined by this computation for each such partition as Jay suggests. > Add a throttling option to the Kafka replication tool > ----------------------------------------------------- > > Key: KAFKA-1464 > URL: https://issues.apache.org/jira/browse/KAFKA-1464 > Project: Kafka > Issue Type: New Feature > Components: replication > Affects Versions: 0.8.0 > Reporter: mjuarez > Assignee: Ismael Juma > Priority: Minor > Labels: replication, replication-tools > Fix For: 0.9.1.0 > > > When performing replication on new nodes of a Kafka cluster, the replication > process will use all available resources to replicate as fast as possible. > This causes performance issues (mostly disk IO and sometimes network > bandwidth) when doing this in a production environment, in which you're > trying to serve downstream applications, at the same time you're performing > maintenance on the Kafka cluster. > An option to throttle the replication to a specific rate (in either MB/s or > activities/second) would help production systems to better handle maintenance > tasks while still serving downstream applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332)