Hi Kamal, Apologies for the very long delay, but I have now updated the KIP to include the metrics and the kafka-reassign-partitions.sh clarification.
Thanks, Harry On Thursday, 18 July 2024 at 12:16, Kamal Chandraprakash <kamal.chandraprak...@gmail.com> wrote: > Hi Harry, > > Thanks for the updates! > > Yes, the proposed metric looks good. > > If the user runs the kafka-reassign-partitions script with throttle set, > then the static throttle gets overwritten > until the reassignment gets completed. Can you clarify this on the KIP? > > -- > Kamal > > > > On Sun, Jul 14, 2024 at 9:59 PM Harry Fallows > harryfall...@protonmail.com.invalid wrote: > > > Hi Kamal, > > > > Thank you for reading KIP-1051! > > > > Yes, it's true that it can impact regular replication traffic. However, > > network throughput is bounded so regardless of whether we allow it as a > > config in Kafka or not, there is always a chance that replication traffic > > will get throttled. Having it as a config will at least ensure that the > > entire bandwidth is not taken up by replication traffic. > > > > I agree, the nature of the leader replication throttling is dependent on > > how many followers there are, however, I don't think it's dependent on the > > partition assignment strategy or the number of brokers; it should only be > > dependent on the replication factor. I think it's key to point out here > > that these configurations do not need to be "optimised" for use cases with > > different replication factors, they just need to be set to match the > > infrastructure that they are deployed in. For example if you have a maximum > > network bandwidth of 200MB/s and a replication factor of 3, you may set > > follower.replication.throttled.replicas to 150MB/s, to reserve some > > bandwidth for other traffic (e.g. producing and consuming). In this case, > > if you start with all replicas in sync, I don't think it's possible for the > > follower throttling to be the sole cause of a replica falling out of sync. > > It may be the case that it takes longer for an out-of-sync replica to > > become in sync, but in that case the replication throttling just serves to > > mitigate other traffic from getting throttled (e.g. producer traffic to a > > different partition). Even so, it is possible that misconfiguring these > > values could cause issues, so the potential consequences should be clearly > > documented. > > > > I think the concern about producing spikes causing ISR issues is only an > > issue if these values are poorly configured. I think in general if these > > values are always configured as >= > > (replicationFactor/(replicationFactor+1))*maxBandwidth (e.g. like the above > > example: 3/(3+1) * 200 = 150), then even if 100% of the non-replication > > traffic is producer traffic, all followers should be able to stay in sync. > > > > I like the idea of emitting a metric for when a quota is breached, what do > > you think about having it as a gauge for number of partitions that are > > currently leader of follower throttled (similar to the URP metric)? > > > > Kind regards, > > Harry > > > > On Thursday, 11 July 2024 at 19:02, Kamal Chandraprakash < > > kamal.chandraprak...@gmail.com> wrote: > > > > > Hi Harry Fallows, > > > > > > Thanks for the KIP! > > > > > > I went over both the KIP-1051 and KIP-1009. Assuming that the > > > leader.replication.throttled.replicas > > > and follower.replication.throttled.replicas are set to Wildcard (*) to > > > apply for all the partitions in the > > > broker. If we set a static value for leader and follower replication > > > throttled rate, then it might impact > > > the normal replication traffic. > > > > > > Throttling rate depends on the number of brokers in the cluster. If the > > > cluster contains 100+ brokers, then > > > the leader.replication.throttled.rate is shared across all the followers. > > > The number of followers reading > > > data from the leader depends on the partition assignment strategy. If the > > > leader replication throttle is breached, > > > then the follower might fail to catch-up with the leader. > > > > > > If there are sudden spikes in a specific set of topics/partitions in the > > > cluster, then the replicas might fail to join > > > the isr and can impact the cluster reliability. If we are going with this > > > proposal, then we may also have to emit > > > a metric to inform the administrator that the leader/follower replication > > > quota is breached. > > > > > > -- > > > Kamal > > > > > > On Thu, Jul 4, 2024 at 8:10 PM Harry Fallows > > > harryfall...@protonmail.com.invalid wrote: > > > > > > > Hi everyone, > > > > > > > > Bumping this one last time before I call a vote. Please take a look if > > > > you're interested in replication throttling and/or static/dynamic > > > > config. > > > > > > > > Kind regards, > > > > Harry > > > > > > > > On Thursday, 13 June 2024 at 19:39, Harry Fallows < > > > > harryfall...@protonmail.com.INVALID> wrote: > > > > > > > > > Hi Hector, > > > > > > > > > > I did see your colleague's KIP, and I actually mentioned it in the > > > > > KIP > > > > > that I have written. As I see it, both of these KIPs move towards > > > > > more > > > > > easily configurable replication throttling and both should be > > > > > implemented. > > > > > KIP-1009 makes it easier to enable throttling and KIP-1051 makes it > > > > > easier > > > > > to apply a throttle rate. I did try to look at supporting KIP-1009 > > > > > in the > > > > > discussion thread, however, I only subscribed to the mailing list > > > > > after it > > > > > was published and I couldn't figure out how to respond to it in Pony > > > > > mail. > > > > > I would be definitely be interested in partnering up to get both > > > > > changes > > > > > across the line, whether that be by combining them or supporting both > > > > > individually (I'm not sure which is best, this is my first > > > > > contribution!). > > > > > > > > > > I also see that KAFKA-10190 is mentioned in KIP-1009 as a related > > > > > ticket. Coincidentally, I raised a PR to address this bug a couple > > > > > of days > > > > > ago (https://github.com/apache/kafka/pull/16280). I think this is > > > > > also a > > > > > change that will move towards more easily configurable replication > > > > > throttling as it allows configuring the throttle rate across the > > > > > whole > > > > > cluster via a default value. As far as I understand, this change > > > > > does not > > > > > need a KIP though because it is a bugfix (the current behaviour of > > > > > ignoring > > > > > the default is unintentional). > > > > > > > > > > Let me know what you think. > > > > > > > > > > Kind regards, > > > > > Harry > > > > > > > > > > -------- Original Message -------- > > > > > On 6/13/24 19:08, Hector Geraldino (BLOOMBERG/ 919 3RD A) > > > > > hgerald...@bloomberg.net wrote: > > > > > > > > > > > Hi Harry, > > > > > > > > > > > > A colleague of mine opened KIP-1009: Add Broker-level Throttle > > > > > > Configurations, which aims to achieve the same goal (although from > > > > > > a > > > > > > different angle). > > > > > > > > > > > > Can you please take a look and see if this would work for the > > > > > > things > > > > > > you have in mind? Maybe we can partner and coalesce around either > > > > > > KIP and > > > > > > try to push it to the end line. > > > > > > > > > > > > KIP: > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1009%3A+Add+Broker-level+Throttle+Configurations > > > > > > > > From: dev@kafka.apache.org At: 06/13/24 09:22:40 UTC-4:00To: > > > > > > dev@kafka.apache.org > > > > > > Subject: Re: [DISCUSS] KIP-1051 Statically configured log > > > > > > replication > > > > > > throttling > > > > > > > > > > > > Hi everyone, > > > > > > > > > > > > Bumping this thread, as I haven't yet had any replies. > > > > > > > > > > > > Kind regards, > > > > > > Harry > > > > > > > > > > > > On Thursday, 6 June 2024 at 17:59, Harry Fallows > > > > > > harryfall...@protonmail.com.INVALID wrote: > > > > > > > > > > > > > Hi everyone, > > > > > > > > > > > > > > I would like to propose a change to allow the static > > > > > > > configuration > > > > > > > of leader > > > > > > > and follower replication throttling rates. > > > > > > > > > > > > > > These configurations are very useful for preventing client > > > > > > > traffic > > > > > > > from > > > > > > > getting throttled by replication traffic during events that > > > > > > > cause a > > > > > > > spike in > > > > > > > replication. Currently they are only configurable dynamically, > > > > > > > which > > > > > > > means they > > > > > > > are only really useful for throttling replication traffic during > > > > > > > planned > > > > > > > events. By allowing these configurations to be set statically, > > > > > > > they > > > > > > > can be used > > > > > > > to prevent client traffic throttling during unplanned events. > > > > > > > > > > > > > > KIP: > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1051%3A+Statically+configu > > > > > > > > > red+log+replication+throttling > > > > > > > > > > > > > > Best regards, > > > > > > > Harry Fallows