Re: [DISCUSS] KIP-1051 Statically configured log replication throttling

Harry Fallows Thu, 26 Dec 2024 06:53:05 -0800

Hi Kamal,

Apologies for the very long delay, but I have now updated the KIP to include 
the metrics and the kafka-reassign-partitions.sh clarification.


Thanks,
Harry

On Thursday, 18 July 2024 at 12:16, Kamal Chandraprakash 
<kamal.chandraprak...@gmail.com> wrote:

> Hi Harry,
> 
> Thanks for the updates!
> 
> Yes, the proposed metric looks good.
> 
> If the user runs the kafka-reassign-partitions script with throttle set,
> then the static throttle gets overwritten
> until the reassignment gets completed. Can you clarify this on the KIP?
> 
> --
> Kamal
> 
> 
> 
> On Sun, Jul 14, 2024 at 9:59 PM Harry Fallows
> harryfall...@protonmail.com.invalid wrote:
> 
> > Hi Kamal,
> > 
> > Thank you for reading KIP-1051!
> > 
> > Yes, it's true that it can impact regular replication traffic. However,
> > network throughput is bounded so regardless of whether we allow it as a
> > config in Kafka or not, there is always a chance that replication traffic
> > will get throttled. Having it as a config will at least ensure that the
> > entire bandwidth is not taken up by replication traffic.
> > 
> > I agree, the nature of the leader replication throttling is dependent on
> > how many followers there are, however, I don't think it's dependent on the
> > partition assignment strategy or the number of brokers; it should only be
> > dependent on the replication factor. I think it's key to point out here
> > that these configurations do not need to be "optimised" for use cases with
> > different replication factors, they just need to be set to match the
> > infrastructure that they are deployed in. For example if you have a maximum
> > network bandwidth of 200MB/s and a replication factor of 3, you may set
> > follower.replication.throttled.replicas to 150MB/s, to reserve some
> > bandwidth for other traffic (e.g. producing and consuming). In this case,
> > if you start with all replicas in sync, I don't think it's possible for the
> > follower throttling to be the sole cause of a replica falling out of sync.
> > It may be the case that it takes longer for an out-of-sync replica to
> > become in sync, but in that case the replication throttling just serves to
> > mitigate other traffic from getting throttled (e.g. producer traffic to a
> > different partition). Even so, it is possible that misconfiguring these
> > values could cause issues, so the potential consequences should be clearly
> > documented.
> > 
> > I think the concern about producing spikes causing ISR issues is only an
> > issue if these values are poorly configured. I think in general if these
> > values are always configured as >=
> > (replicationFactor/(replicationFactor+1))*maxBandwidth (e.g. like the above
> > example: 3/(3+1) * 200 = 150), then even if 100% of the non-replication
> > traffic is producer traffic, all followers should be able to stay in sync.
> > 
> > I like the idea of emitting a metric for when a quota is breached, what do
> > you think about having it as a gauge for number of partitions that are
> > currently leader of follower throttled (similar to the URP metric)?
> > 
> > Kind regards,
> > Harry
> > 
> > On Thursday, 11 July 2024 at 19:02, Kamal Chandraprakash <
> > kamal.chandraprak...@gmail.com> wrote:
> > 
> > > Hi Harry Fallows,
> > > 
> > > Thanks for the KIP!
> > > 
> > > I went over both the KIP-1051 and KIP-1009. Assuming that the
> > > leader.replication.throttled.replicas
> > > and follower.replication.throttled.replicas are set to Wildcard (*) to
> > > apply for all the partitions in the
> > > broker. If we set a static value for leader and follower replication
> > > throttled rate, then it might impact
> > > the normal replication traffic.
> > > 
> > > Throttling rate depends on the number of brokers in the cluster. If the
> > > cluster contains 100+ brokers, then
> > > the leader.replication.throttled.rate is shared across all the followers.
> > > The number of followers reading
> > > data from the leader depends on the partition assignment strategy. If the
> > > leader replication throttle is breached,
> > > then the follower might fail to catch-up with the leader.
> > > 
> > > If there are sudden spikes in a specific set of topics/partitions in the
> > > cluster, then the replicas might fail to join
> > > the isr and can impact the cluster reliability. If we are going with this
> > > proposal, then we may also have to emit
> > > a metric to inform the administrator that the leader/follower replication
> > > quota is breached.
> > > 
> > > --
> > > Kamal
> > > 
> > > On Thu, Jul 4, 2024 at 8:10 PM Harry Fallows
> > > harryfall...@protonmail.com.invalid wrote:
> > > 
> > > > Hi everyone,
> > > > 
> > > > Bumping this one last time before I call a vote. Please take a look if
> > > > you're interested in replication throttling and/or static/dynamic
> > > > config.
> > > > 
> > > > Kind regards,
> > > > Harry
> > > > 
> > > > On Thursday, 13 June 2024 at 19:39, Harry Fallows <
> > > > harryfall...@protonmail.com.INVALID> wrote:
> > > > 
> > > > > Hi Hector,
> > > > > 
> > > > > I did see your colleague's KIP, and I actually mentioned it in the
> > > > > KIP
> > > > > that I have written. As I see it, both of these KIPs move towards
> > > > > more
> > > > > easily configurable replication throttling and both should be
> > > > > implemented.
> > > > > KIP-1009 makes it easier to enable throttling and KIP-1051 makes it
> > > > > easier
> > > > > to apply a throttle rate. I did try to look at supporting KIP-1009
> > > > > in the
> > > > > discussion thread, however, I only subscribed to the mailing list
> > > > > after it
> > > > > was published and I couldn't figure out how to respond to it in Pony
> > > > > mail.
> > > > > I would be definitely be interested in partnering up to get both
> > > > > changes
> > > > > across the line, whether that be by combining them or supporting both
> > > > > individually (I'm not sure which is best, this is my first
> > > > > contribution!).
> > > > > 
> > > > > I also see that KAFKA-10190 is mentioned in KIP-1009 as a related
> > > > > ticket. Coincidentally, I raised a PR to address this bug a couple
> > > > > of days
> > > > > ago (https://github.com/apache/kafka/pull/16280). I think this is
> > > > > also a
> > > > > change that will move towards more easily configurable replication
> > > > > throttling as it allows configuring the throttle rate across the
> > > > > whole
> > > > > cluster via a default value. As far as I understand, this change
> > > > > does not
> > > > > need a KIP though because it is a bugfix (the current behaviour of
> > > > > ignoring
> > > > > the default is unintentional).
> > > > > 
> > > > > Let me know what you think.
> > > > > 
> > > > > Kind regards,
> > > > > Harry
> > > > > 
> > > > > -------- Original Message --------
> > > > > On 6/13/24 19:08, Hector Geraldino (BLOOMBERG/ 919 3RD A)
> > > > > hgerald...@bloomberg.net wrote:
> > > > > 
> > > > > > Hi Harry,
> > > > > > 
> > > > > > A colleague of mine opened KIP-1009: Add Broker-level Throttle
> > > > > > Configurations, which aims to achieve the same goal (although from
> > > > > > a
> > > > > > different angle).
> > > > > > 
> > > > > > Can you please take a look and see if this would work for the
> > > > > > things
> > > > > > you have in mind? Maybe we can partner and coalesce around either
> > > > > > KIP and
> > > > > > try to push it to the end line.
> > > > > > 
> > > > > > KIP:
> > 
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1009%3A+Add+Broker-level+Throttle+Configurations
> > 
> > > > > > From: dev@kafka.apache.org At: 06/13/24 09:22:40 UTC-4:00To:
> > > > > > dev@kafka.apache.org
> > > > > > Subject: Re: [DISCUSS] KIP-1051 Statically configured log
> > > > > > replication
> > > > > > throttling
> > > > > > 
> > > > > > Hi everyone,
> > > > > > 
> > > > > > Bumping this thread, as I haven't yet had any replies.
> > > > > > 
> > > > > > Kind regards,
> > > > > > Harry
> > > > > > 
> > > > > > On Thursday, 6 June 2024 at 17:59, Harry Fallows
> > > > > > harryfall...@protonmail.com.INVALID wrote:
> > > > > > 
> > > > > > > Hi everyone,
> > > > > > > 
> > > > > > > I would like to propose a change to allow the static
> > > > > > > configuration
> > > > > > > of leader
> > > > > > > and follower replication throttling rates.
> > > > > > > 
> > > > > > > These configurations are very useful for preventing client
> > > > > > > traffic
> > > > > > > from
> > > > > > > getting throttled by replication traffic during events that
> > > > > > > cause a
> > > > > > > spike in
> > > > > > > replication. Currently they are only configurable dynamically,
> > > > > > > which
> > > > > > > means they
> > > > > > > are only really useful for throttling replication traffic during
> > > > > > > planned
> > > > > > > events. By allowing these configurations to be set statically,
> > > > > > > they
> > > > > > > can be used
> > > > > > > to prevent client traffic throttling during unplanned events.
> > > > > > > 
> > > > > > > KIP:
> > 
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1051%3A+Statically+configu
> > 
> > > > > > > red+log+replication+throttling
> > > > > > > 
> > > > > > > Best regards,
> > > > > > > Harry Fallows

Re: [DISCUSS] KIP-1051 Statically configured log replication throttling

Reply via email to