[ 
https://issues.apache.org/jira/browse/KAFKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327561#comment-15327561
 ] 

Ralph Weires commented on KAFKA-1464:
-------------------------------------

We have similar problems as described by Jason above, in our case usually when 
taking a broker offline due to hardware failure (broken HD, with each broker 
being equipped with 2 HDs / log directories in our case). If the broker gets 
back online with one fresh disk and corresponding missing data (i.e. half of 
the partitions of that broker missing), its network link is saturated for some 
time by inbound traffic to catch up with replication.

While the broker is re-streaming all the missing data, we are additionally 
experiencing problems with consumers as well. After the broker has caught up 
with it's missing data, the situation normalizes again quickly.

To me it seems as if the partitions for which the broker already catches up 
soon after restart (esp. the ones from non-broken HD which just had little data 
missing) are causing issues if the broker becomes leader for them, while it is 
otherwise still clogging its incoming link with replication of the remaining 
data.

In this scenario, I would actually prefer to just let the broker catch up with 
any replication it still needs to do, without it becoming leader for any 
partition it has. Isn't there actually a way to achieve this? I.e. just keeping 
a broker online with replication and all, but not having it take over any 
partition leadership (at least so long as there are other candidates available 
for leadership). Being able to toggle that behavior at run-time would be ideal, 
so that we would just explicitly activate it again after the maintenance 
interval, once the node has caught up the bulk of necessary replication. Could 
IMO be an alternative to any throttling approach.

> Add a throttling option to the Kafka replication tool
> -----------------------------------------------------
>
>                 Key: KAFKA-1464
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1464
>             Project: Kafka
>          Issue Type: New Feature
>          Components: replication
>    Affects Versions: 0.8.0
>            Reporter: mjuarez
>            Assignee: Ismael Juma
>            Priority: Minor
>              Labels: replication, replication-tools
>             Fix For: 0.10.1.0
>
>
> When performing replication on new nodes of a Kafka cluster, the replication 
> process will use all available resources to replicate as fast as possible.  
> This causes performance issues (mostly disk IO and sometimes network 
> bandwidth) when doing this in a production environment, in which you're 
> trying to serve downstream applications, at the same time you're performing 
> maintenance on the Kafka cluster.
> An option to throttle the replication to a specific rate (in either MB/s or 
> activities/second) would help production systems to better handle maintenance 
> tasks while still serving downstream applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to