Hi Kamal,

Thank you for reading KIP-1051!

Yes, it's true that it can impact regular replication traffic. However, network 
throughput is bounded so regardless of whether we allow it as a config in Kafka 
or not, there is always a chance that replication traffic will get throttled. 
Having it as a config will at least ensure that the entire bandwidth is not 
taken up by replication traffic.

I agree, the nature of the leader replication throttling is dependent on how 
many followers there are, however, I don't think it's dependent on the 
partition assignment strategy or the number of brokers; it should only be 
dependent on the replication factor. I think it's key to point out here that 
these configurations do not need to be "optimised" for use cases with different 
replication factors, they just need to be set to match the infrastructure that 
they are deployed in. For example if you have a maximum network bandwidth of 
200MB/s and a replication factor of 3, you may set 
follower.replication.throttled.replicas to 150MB/s, to reserve some bandwidth 
for other traffic (e.g. producing and consuming). In this case, if you start 
with all replicas in sync, I don't think it's possible for the follower 
throttling to be the sole cause of a replica falling out of sync. It may be the 
case that it takes longer for an out-of-sync replica to become in sync, but in 
that case the replication throttling just serves to mitigate other traffic from 
getting throttled (e.g. producer traffic to a different partition). Even so, it 
is possible that misconfiguring these values could cause issues, so the 
potential consequences should be clearly documented.

I think the concern about producing spikes causing ISR issues is only an issue 
if these values are poorly configured. I think in general if these values are 
always configured as >= (replicationFactor/(replicationFactor+1))*maxBandwidth 
(e.g. like the above example: 3/(3+1) * 200 = 150), then even if 100% of the 
non-replication traffic is producer traffic, all followers should be able to 
stay in sync.

I like the idea of emitting a metric for when a quota is breached, what do you 
think about having it as a gauge for number of partitions that are currently 
leader of follower throttled (similar to the URP metric)?

Kind regards,
Harry

On Thursday, 11 July 2024 at 19:02, Kamal Chandraprakash 
<kamal.chandraprak...@gmail.com> wrote:

> Hi Harry Fallows,
>
> Thanks for the KIP!
>
> I went over both the KIP-1051 and KIP-1009. Assuming that the
> leader.replication.throttled.replicas
> and follower.replication.throttled.replicas are set to Wildcard (*) to
> apply for all the partitions in the
> broker. If we set a static value for leader and follower replication
> throttled rate, then it might impact
> the normal replication traffic.
>
> Throttling rate depends on the number of brokers in the cluster. If the
> cluster contains 100+ brokers, then
> the leader.replication.throttled.rate is shared across all the followers.
> The number of followers reading
> data from the leader depends on the partition assignment strategy. If the
> leader replication throttle is breached,
> then the follower might fail to catch-up with the leader.
>
> If there are sudden spikes in a specific set of topics/partitions in the
> cluster, then the replicas might fail to join
> the isr and can impact the cluster reliability. If we are going with this
> proposal, then we may also have to emit
> a metric to inform the administrator that the leader/follower replication
> quota is breached.
>
> --
> Kamal
>
> On Thu, Jul 4, 2024 at 8:10 PM Harry Fallows
> harryfall...@protonmail.com.invalid wrote:
>
> > Hi everyone,
> >
> > Bumping this one last time before I call a vote. Please take a look if
> > you're interested in replication throttling and/or static/dynamic config.
> >
> > Kind regards,
> > Harry
> >
> > On Thursday, 13 June 2024 at 19:39, Harry Fallows <
> > harryfall...@protonmail.com.INVALID> wrote:
> >
> > > Hi Hector,
> > >
> > > I did see your colleague's KIP, and I actually mentioned it in the KIP
> > > that I have written. As I see it, both of these KIPs move towards more
> > > easily configurable replication throttling and both should be implemented.
> > > KIP-1009 makes it easier to enable throttling and KIP-1051 makes it easier
> > > to apply a throttle rate. I did try to look at supporting KIP-1009 in the
> > > discussion thread, however, I only subscribed to the mailing list after it
> > > was published and I couldn't figure out how to respond to it in Pony mail.
> > > I would be definitely be interested in partnering up to get both changes
> > > across the line, whether that be by combining them or supporting both
> > > individually (I'm not sure which is best, this is my first contribution!).
> > >
> > > I also see that KAFKA-10190 is mentioned in KIP-1009 as a related
> > > ticket. Coincidentally, I raised a PR to address this bug a couple of days
> > > ago (https://github.com/apache/kafka/pull/16280). I think this is also a
> > > change that will move towards more easily configurable replication
> > > throttling as it allows configuring the throttle rate across the whole
> > > cluster via a default value. As far as I understand, this change does not
> > > need a KIP though because it is a bugfix (the current behaviour of 
> > > ignoring
> > > the default is unintentional).
> > >
> > > Let me know what you think.
> > >
> > > Kind regards,
> > > Harry
> > >
> > > -------- Original Message --------
> > > On 6/13/24 19:08, Hector Geraldino (BLOOMBERG/ 919 3RD A)
> > > hgerald...@bloomberg.net wrote:
> > >
> > > > Hi Harry,
> > > >
> > > > A colleague of mine opened KIP-1009: Add Broker-level Throttle
> > > > Configurations, which aims to achieve the same goal (although from a
> > > > different angle).
> > > >
> > > > Can you please take a look and see if this would work for the things
> > > > you have in mind? Maybe we can partner and coalesce around either KIP 
> > > > and
> > > > try to push it to the end line.
> > > >
> > > > KIP:
> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1009%3A+Add+Broker-level+Throttle+Configurations
> > > >
> > > > From: dev@kafka.apache.org At: 06/13/24 09:22:40 UTC-4:00To:
> > > > dev@kafka.apache.org
> > > > Subject: Re: [DISCUSS] KIP-1051 Statically configured log replication
> > > > throttling
> > > >
> > > > Hi everyone,
> > > >
> > > > Bumping this thread, as I haven't yet had any replies.
> > > >
> > > > Kind regards,
> > > > Harry
> > > >
> > > > On Thursday, 6 June 2024 at 17:59, Harry Fallows
> > > > harryfall...@protonmail.com.INVALID wrote:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > I would like to propose a change to allow the static configuration
> > > > > of leader
> > > > > and follower replication throttling rates.
> > > > >
> > > > > These configurations are very useful for preventing client traffic
> > > > > from
> > > > > getting throttled by replication traffic during events that cause a
> > > > > spike in
> > > > > replication. Currently they are only configurable dynamically, which
> > > > > means they
> > > > > are only really useful for throttling replication traffic during
> > > > > planned
> > > > > events. By allowing these configurations to be set statically, they
> > > > > can be used
> > > > > to prevent client traffic throttling during unplanned events.
> > > > >
> > > > > KIP:
> >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1051%3A+Statically+configu
> >
> > > > > red+log+replication+throttling
> > > > >
> > > > > Best regards,
> > > > > Harry Fallows

Reply via email to