Re: [DISCUSS] KIP-73: Replication Quotas

Gwen Shapira Thu, 18 Aug 2016 11:35:11 -0700

Yes, I think its a great discussion to have. There are definitely pros
and cons to both approaches and worth thinking about the right way
forward.



On Thu, Aug 18, 2016 at 11:03 AM, Todd Palino <tpal...@gmail.com> wrote:
> This all makes a lot of sense, and mirrors what I’m thinking as I finally
> took some time to really walk through scenarios around why we move
> partitions around.
>
> What I’m wondering if it makes sense to have a conversation around breaking
> out the controller entirely, separating it from the brokers, and starting
> to add this intelligence into that. I don’t think anyone will disagree that
> the controller needs a sizable amount of work. This definitely wouldn’t be
> the first project to separate out the brains from the dumb worker processes.
>
> -Todd
>
>
> On Thu, Aug 18, 2016 at 10:53 AM, Gwen Shapira <g...@confluent.io> wrote:
>
>> Just my take, since Jun and Ben originally wanted to solve a more
>> general approach and I talked them out of it :)
>>
>> When we first add the feature, safety is probably most important in
>> getting people to adopt it - I wanted to make the feature very safe by
>> never throttling something admins don't want to throttle. So we
>> figured manual approach, while more challenging to configure, is the
>> safest. Admins usually know which replicas are "at risk" of taking
>> over and can choose to throttle them accordingly, they can build their
>> own integration with monitoring tools, etc.
>>
>> It feels like any "smarts" we try and build into Kafka can be done
>> better with external tools that can watch both Kafka traffic (with the
>> new metrics) and things like network and CPU monitors.
>>
>> We are open to a smarter approach in Kafka, but perhaps plan it for a
>> follow-up KIP? Maybe even after we have some experience with the
>> manual approach and how best to make throttling decisions.
>> Similar to what we do with choosing partitions to move around - we
>> started manually, admins are getting experience at how they like to
>> choose replicas and then we can bake their expertise into the product.
>>
>> Gwen
>>
>> On Thu, Aug 18, 2016 at 10:29 AM, Jun Rao <j...@confluent.io> wrote:
>> > Joel,
>> >
>> > Yes, for your second comment. The tricky thing is still to figure out
>> which
>> > replicas to throttle and by how much since in general, admins probably
>> > don't want already in-sync or close to in-sync replicas to be throttled.
>> It
>> > would be great to get Todd's opinion on this. Could you ping him?
>> >
>> > Yes, we'd be happy to discuss auto-detection of effect traffic more
>> offline.
>> >
>> > Thanks,
>> >
>> > Jun
>> >
>> > On Thu, Aug 18, 2016 at 10:21 AM, Joel Koshy <jjkosh...@gmail.com>
>> wrote:
>> >
>> >> > For your first comment. We thought about determining "effect" replicas
>> >> > automatically as well. First, there are some tricky stuff that one
>> has to
>> >> >
>> >>
>> >> Auto-detection of effect traffic: i'm fairly certain it's doable but
>> >> definitely tricky. I'm also not sure it is something worth tackling at
>> the
>> >> outset. If we want to spend more time thinking over it even if it's
>> just an
>> >> academic exercise I would be happy to brainstorm offline.
>> >>
>> >>
>> >> > For your second comment, we discussed that in the client quotas
>> design. A
>> >> > down side of that for client quotas is that a client may be surprised
>> >> that
>> >> > its traffic is not throttled at one time, but throttled as another
>> with
>> >> the
>> >> > same quota (basically, less predicability). You can imaging setting a
>> >> quota
>> >> > for all replication traffic and only slow down the "effect" replicas
>> if
>> >> > needed. The thought is more or less the same as the above. It requires
>> >> more
>> >> >
>> >>
>> >> For clients, this is true. I think this is much less of an issue for
>> >> server-side replication since the "users" here are the Kafka SREs who
>> >> generally know these internal details.
>> >>
>> >> I think it would be valuable to get some feedback from SREs on the
>> proposal
>> >> before proceeding to a vote. (ping Todd)
>> >>
>> >> Joel
>> >>
>> >>
>> >> >
>> >> > On Thu, Aug 18, 2016 at 9:37 AM, Ben Stopford <b...@confluent.io>
>> wrote:
>> >> >
>> >> > > Hi Joel
>> >> > >
>> >> > > Ha! yes we had some similar thoughts, on both counts. Both are
>> actually
>> >> > > good approaches, but come with some extra complexity.
>> >> > >
>> >> > > Segregating the replication type is tempting as it creates a more
>> >> general
>> >> > > solution. One issue is you need to draw a line between lagging and
>> not
>> >> > > lagging. The ISR ‘limit' is a tempting divider, but has the side
>> effect
>> >> > > that, once you drop out you get immediately throttled. Adding a
>> >> > > configurable divider is another option, but difficult for admins to
>> >> set,
>> >> > > and always a little arbitrary. A better idea is to prioritise, in
>> >> reverse
>> >> > > order to lag. But that also comes with additional complexity of its
>> >> own.
>> >> > >
>> >> > > Under throttling is also a tempting addition. That’s to say, if
>> there’s
>> >> > > idle bandwidth lying around, not being used, why not use it to let
>> >> > lagging
>> >> > > brokers catch up. This involves some comparison to the maximum
>> >> bandwidth,
>> >> > > which could be configurable, or could be derived, with pros and cons
>> >> for
>> >> > > each.
>> >> > >
>> >> > > But the more general problem is actually quite hard to reason
>> about, so
>> >> > > after some discussion we decided to settle on something simple,
>> that we
>> >> > > felt we could get working, and extend to add these additional
>> features
>> >> as
>> >> > > subsequent KIPs.
>> >> > >
>> >> > > I hope that seems reasonable. Jun may wish to add to this.
>> >> > >
>> >> > > B
>> >> > >
>> >> > >
>> >> > > > On 18 Aug 2016, at 06:56, Joel Koshy <jjkosh...@gmail.com> wrote:
>> >> > > >
>> >> > > > On Wed, Aug 17, 2016 at 9:13 PM, Ben Stopford <b...@confluent.io>
>> >> > wrote:
>> >> > > >
>> >> > > >>
>> >> > > >> Let's us know if you have any further thoughts on KIP-73, else
>> we'll
>> >> > > kick
>> >> > > >> off a vote.
>> >> > > >>
>> >> > > >
>> >> > > > I think the mechanism for throttling replicas looks good. Just
>> had a
>> >> > few
>> >> > > > more thoughts on the configuration section. What you have looks
>> >> > > reasonable,
>> >> > > > but I was wondering if it could be made simpler. You probably
>> thought
>> >> > > > through these, so I'm curious to know your take.
>> >> > > >
>> >> > > > My guess is that most of the time, users would want to throttle
>> all
>> >> > > effect
>> >> > > > replication - due to partition reassignments, adding brokers or a
>> >> > broker
>> >> > > > coming back online after an extended period of time. In all these
>> >> > > scenarios
>> >> > > > it may be possible to distinguish bootstrap (effect) vs normal
>> >> > > replication
>> >> > > > - based on how far the replica has to catch up. I'm wondering if
>> it
>> >> is
>> >> > > > enough to just set an umbrella "effect" replication quota with
>> >> perhaps
>> >> > > > per-topic overrides (say if some topics are more important than
>> >> others)
>> >> > > as
>> >> > > > opposed to designating throttled replicas.
>> >> > > >
>> >> > > > Also, IIRC during client-side quota discussions we had considered
>> the
>> >> > > > possibility of allowing clients to go above their quotas when
>> >> resources
>> >> > > are
>> >> > > > available. We ended up not doing that, but for replication
>> throttling
>> >> > it
>> >> > > > may make sense - i.e., to treat the quota as a soft limit. Another
>> >> way
>> >> > to
>> >> > > > look at it is instead of ensuring "effect replication traffic does
>> >> not
>> >> > > flow
>> >> > > > faster than X bytes/sec" it may be useful to instead ensure that
>> >> > "effect
>> >> > > > replication traffic only flows as slowly as necessary (so as not
>> to
>> >> > > > adversely affect normal replication traffic)."
>> >> > > >
>> >> > > > Thanks,
>> >> > > >
>> >> > > > Joel
>> >> > > >
>> >> > > >>>
>> >> > > >>>> On Thu, Aug 11, 2016 at 2:43 PM, Jun Rao <j...@confluent.io
>> >> > > >>> <javascript:;>> wrote:
>> >> > > >>>>
>> >> > > >>>>> Hi, Joel,
>> >> > > >>>>>
>> >> > > >>>>> Yes, the response size includes both throttled and unthrottled
>> >> > > >>> replicas.
>> >> > > >>>>> However, the response is only delayed up to max.wait if the
>> >> > response
>> >> > > >>> size
>> >> > > >>>>> is less than min.bytes, which matches the current behavior.
>> So,
>> >> > there
>> >> > > >>> is
>> >> > > >>>> no
>> >> > > >>>>> extra delay to due throttling, right? For replica fetchers,
>> the
>> >> > > >> default
>> >> > > >>>>> min.byte is 1. So, the response is only delayed if there is no
>> >> byte
>> >> > > >> in
>> >> > > >>>> the
>> >> > > >>>>> response, which is what we want.
>> >> > > >>>>>
>> >> > > >>>>> Thanks,
>> >> > > >>>>>
>> >> > > >>>>> Jun
>> >> > > >>>>>
>> >> > > >>>>> On Thu, Aug 11, 2016 at 11:53 AM, Joel Koshy <
>> >> jjkosh...@gmail.com
>> >> > > >>> <javascript:;>>
>> >> > > >>>> wrote:
>> >> > > >>>>>
>> >> > > >>>>>> Hi Jun,
>> >> > > >>>>>>
>> >> > > >>>>>> I'm not sure that would work unless we have separate replica
>> >> > > >>> fetchers,
>> >> > > >>>>>> since this would cause all replicas (including ones that are
>> not
>> >> > > >>>>> throttled)
>> >> > > >>>>>> to get delayed. Instead, we could just have the leader
>> populate
>> >> > the
>> >> > > >>>>>> throttle-time field of the response as a hint to the
>> follower as
>> >> > to
>> >> > > >>> how
>> >> > > >>>>>> long it should wait before it adds those replicas back to its
>> >> > > >>>> subsequent
>> >> > > >>>>>> replica fetch requests.
>> >> > > >>>>>>
>> >> > > >>>>>> Thanks,
>> >> > > >>>>>>
>> >> > > >>>>>> Joel
>> >> > > >>>>>>
>> >> > > >>>>>> On Thu, Aug 11, 2016 at 9:50 AM, Jun Rao <j...@confluent.io
>> >> > > >>> <javascript:;>> wrote:
>> >> > > >>>>>>
>> >> > > >>>>>>> Mayuresh,
>> >> > > >>>>>>>
>> >> > > >>>>>>> That's a good question. I think if the response size (after
>> >> > > >> leader
>> >> > > >>>>>>> throttling) is smaller than min.bytes, we will just delay
>> the
>> >> > > >>> sending
>> >> > > >>>>> of
>> >> > > >>>>>>> the response up to max.wait as we do now. This should
>> prevent
>> >> > > >>>> frequent
>> >> > > >>>>>>> empty responses to the follower.
>> >> > > >>>>>>>
>> >> > > >>>>>>> Thanks,
>> >> > > >>>>>>>
>> >> > > >>>>>>> Jun
>> >> > > >>>>>>>
>> >> > > >>>>>>> On Wed, Aug 10, 2016 at 9:17 PM, Mayuresh Gharat <
>> >> > > >>>>>>> gharatmayures...@gmail.com <javascript:;>
>> >> > > >>>>>>>> wrote:
>> >> > > >>>>>>>
>> >> > > >>>>>>>> This might have been answered before.
>> >> > > >>>>>>>> I was wondering when the leader quota is reached and it
>> sends
>> >> > > >>> empty
>> >> > > >>>>>>>> response ( If the inclusion of a partition, listed in the
>> >> > > >>> leader's
>> >> > > >>>>>>>> throttled-replicas list, causes the LeaderQuotaRate to be
>> >> > > >>> exceeded,
>> >> > > >>>>>> that
>> >> > > >>>>>>>> partition is omitted from the response (aka returns 0
>> >> bytes).).
>> >> > > >>> At
>> >> > > >>>>> this
>> >> > > >>>>>>>> point the follower quota is NOT reached and the follower is
>> >> > > >> still
>> >> > > >>>>> going
>> >> > > >>>>>>> to
>> >> > > >>>>>>>> ask for the that partition in the next fetch request.
>> Would it
>> >> > > >> be
>> >> > > >>>>> fair
>> >> > > >>>>>> to
>> >> > > >>>>>>>> add some logic there so that the follower backs off ( for
>> some
>> >> > > >>>>>>> configurable
>> >> > > >>>>>>>> time) from including those partitions in the next fetch
>> >> > > >> request?
>> >> > > >>>>>>>>
>> >> > > >>>>>>>> Thanks,
>> >> > > >>>>>>>>
>> >> > > >>>>>>>> Mayuresh
>> >> > > >>>>>>>>
>> >> > > >>>>>>>> On Wed, Aug 10, 2016 at 8:06 AM, Ben Stopford <
>> >> > > >> b...@confluent.io
>> >> > > >>> <javascript:;>>
>> >> > > >>>>>> wrote:
>> >> > > >>>>>>>>
>> >> > > >>>>>>>>> Thanks again for the responses everyone. I’ve removed the
>> the
>> >> > > >>>> extra
>> >> > > >>>>>>>>> fetcher threads from the proposal, switching to the
>> >> > > >>>> inclusion-based
>> >> > > >>>>>>>>> approach. The relevant section is:
>> >> > > >>>>>>>>>
>> >> > > >>>>>>>>> The follower makes a requests, using the fixed size of
>> >> > > >>>>>>>>> replica.fetch.response.max.bytes as per KIP-74 <
>> >> > > >>>>>>>> https://cwiki.apache.org/
>> >> > > >>>>>>>>> confluence/display/KAFKA/KIP-
>> 74%3A+Add+Fetch+Response+Size+
>> >> > > >>>>>>>> Limit+in+Bytes>.
>> >> > > >>>>>>>>> The order of the partitions in the fetch request are
>> >> > > >> randomised
>> >> > > >>>> to
>> >> > > >>>>>>> ensure
>> >> > > >>>>>>>>> fairness.
>> >> > > >>>>>>>>> When the leader receives the fetch request it processes
>> the
>> >> > > >>>>>> partitions
>> >> > > >>>>>>> in
>> >> > > >>>>>>>>> the defined order, up to the response's size limit. If the
>> >> > > >>>>> inclusion
>> >> > > >>>>>>> of a
>> >> > > >>>>>>>>> partition, listed in the leader's throttled-replicas list,
>> >> > > >>> causes
>> >> > > >>>>> the
>> >> > > >>>>>>>>> LeaderQuotaRate to be exceeded, that partition is omitted
>> >> > > >> from
>> >> > > >>>> the
>> >> > > >>>>>>>> response
>> >> > > >>>>>>>>> (aka returns 0 bytes). Logically, this is of the form:
>> >> > > >>>>>>>>> var bytesAllowedForThrottledPartition =
>> >> > > >>>>> quota.recordAndMaybeAdjust(
>> >> > > >>>>>>>>> bytesRequestedForPartition)
>> >> > > >>>>>>>>> When the follower receives the fetch response, if it
>> includes
>> >> > > >>>>>>> partitions
>> >> > > >>>>>>>>> in its throttled-partitions list, it increments the
>> >> > > >>>>>> FollowerQuotaRate:
>> >> > > >>>>>>>>> var includeThrottledPartitionsInNextRequest: Boolean =
>> >> > > >>>>>>>>> quota.recordAndEvaluate(previousResponseThrottledBytes)
>> >> > > >>>>>>>>> If the quota is exceeded, no throttled partitions will be
>> >> > > >>>> included
>> >> > > >>>>> in
>> >> > > >>>>>>> the
>> >> > > >>>>>>>>> next fetch request emitted by this replica fetcher thread.
>> >> > > >>>>>>>>>
>> >> > > >>>>>>>>> B
>> >> > > >>>>>>>>>
>> >> > > >>>>>>>>>> On 9 Aug 2016, at 23:34, Jun Rao <j...@confluent.io
>> >> > > >>> <javascript:;>> wrote:
>> >> > > >>>>>>>>>>
>> >> > > >>>>>>>>>> When there are several unthrottled replicas, we could
>> also
>> >> > > >>> just
>> >> > > >>>>> do
>> >> > > >>>>>>>> what's
>> >> > > >>>>>>>>>> suggested in KIP-74. The client is responsible for
>> >> > > >> reordering
>> >> > > >>>> the
>> >> > > >>>>>>>>>> partitions and the leader fills in the bytes to those
>> >> > > >>>> partitions
>> >> > > >>>>> in
>> >> > > >>>>>>>>> order,
>> >> > > >>>>>>>>>> up to the quota limit.
>> >> > > >>>>>>>>>>
>> >> > > >>>>>>>>>> We could also do what you suggested. If quota is
>> exceeded,
>> >> > > >>>>> include
>> >> > > >>>>>>>> empty
>> >> > > >>>>>>>>>> data in the response for throttled replicas. Keep doing
>> >> > > >> that
>> >> > > >>>>> until
>> >> > > >>>>>>>> enough
>> >> > > >>>>>>>>>> time has passed so that the quota is no longer exceeded.
>> >> > > >> This
>> >> > > >>>>>>>> potentially
>> >> > > >>>>>>>>>> allows better batching per partition. Not sure if the two
>> >> > > >>>> makes a
>> >> > > >>>>>> big
>> >> > > >>>>>>>>>> difference in practice though.
>> >> > > >>>>>>>>>>
>> >> > > >>>>>>>>>> Thanks,
>> >> > > >>>>>>>>>>
>> >> > > >>>>>>>>>> Jun
>> >> > > >>>>>>>>>>
>> >> > > >>>>>>>>>>
>> >> > > >>>>>>>>>> On Tue, Aug 9, 2016 at 2:31 PM, Joel Koshy <
>> >> > > >>>> jjkosh...@gmail.com <javascript:;>>
>> >> > > >>>>>>>> wrote:
>> >> > > >>>>>>>>>>
>> >> > > >>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>
>> >> > > >>>>>>>>>>>> On the leader side, one challenge is related to the
>> >> > > >>> fairness
>> >> > > >>>>>> issue
>> >> > > >>>>>>>> that
>> >> > > >>>>>>>>>>> Ben
>> >> > > >>>>>>>>>>>> brought up. The question is what if the fetch response
>> >> > > >>> limit
>> >> > > >>>> is
>> >> > > >>>>>>>> filled
>> >> > > >>>>>>>>> up
>> >> > > >>>>>>>>>>>> by the throttled replicas? If this happens constantly,
>> we
>> >> > > >>>> will
>> >> > > >>>>>>> delay
>> >> > > >>>>>>>>> the
>> >> > > >>>>>>>>>>>> progress of those un-throttled replicas. However, I
>> think
>> >> > > >>> we
>> >> > > >>>>> can
>> >> > > >>>>>>>>> address
>> >> > > >>>>>>>>>>>> this issue by trying to fill up the unthrottled
>> replicas
>> >> > > >> in
>> >> > > >>>> the
>> >> > > >>>>>>>>> response
>> >> > > >>>>>>>>>>>> first. So, the algorithm would be. Fill up unthrottled
>> >> > > >>>> replicas
>> >> > > >>>>>> up
>> >> > > >>>>>>> to
>> >> > > >>>>>>>>> the
>> >> > > >>>>>>>>>>>> fetch response limit. If there is space left, fill up
>> >> > > >>>> throttled
>> >> > > >>>>>>>>> replicas.
>> >> > > >>>>>>>>>>>> If quota is exceeded for the throttled replicas, reduce
>> >> > > >> the
>> >> > > >>>>> bytes
>> >> > > >>>>>>> in
>> >> > > >>>>>>>>> the
>> >> > > >>>>>>>>>>>> throttled replicas in the response accordingly.
>> >> > > >>>>>>>>>>>>
>> >> > > >>>>>>>>>>>
>> >> > > >>>>>>>>>>> Right - that's what I was trying to convey by truncation
>> >> > > >> (vs
>> >> > > >>>>>> empty).
>> >> > > >>>>>>>> So
>> >> > > >>>>>>>>> we
>> >> > > >>>>>>>>>>> would attempt to fill the response for throttled
>> >> > > >> partitions
>> >> > > >>> as
>> >> > > >>>>>> much
>> >> > > >>>>>>> as
>> >> > > >>>>>>>>> we
>> >> > > >>>>>>>>>>> can before hitting the quota limit. There is one more
>> >> > > >> detail
>> >> > > >>>> to
>> >> > > >>>>>>> handle
>> >> > > >>>>>>>>> in
>> >> > > >>>>>>>>>>> this: if there are several throttled partitions and not
>> >> > > >>> enough
>> >> > > >>>>>>>> remaining
>> >> > > >>>>>>>>>>> allowance in the fetch response to include all the
>> >> > > >> throttled
>> >> > > >>>>>>> replicas
>> >> > > >>>>>>>>> then
>> >> > > >>>>>>>>>>> we would need to decide which of those partitions get a
>> >> > > >>> share;
>> >> > > >>>>>> which
>> >> > > >>>>>>>> is
>> >> > > >>>>>>>>> why
>> >> > > >>>>>>>>>>> I'm wondering if it is easier to return empty for those
>> >> > > >>>>> partitions
>> >> > > >>>>>>>>> entirely
>> >> > > >>>>>>>>>>> in the fetch response - they will make progress in the
>> >> > > >>>>> subsequent
>> >> > > >>>>>>>>> fetch. If
>> >> > > >>>>>>>>>>> they don't make fast enough progress then that would be
>> a
>> >> > > >>> case
>> >> > > >>>>> for
>> >> > > >>>>>>>>> raising
>> >> > > >>>>>>>>>>> the threshold or letting it complete at an off-peak
>> time.
>> >> > > >>>>>>>>>>>
>> >> > > >>>>>>>>>>>
>> >> > > >>>>>>>>>>>>
>> >> > > >>>>>>>>>>>> With this approach, we need some new logic to handle
>> >> > > >>>> throttling
>> >> > > >>>>>> on
>> >> > > >>>>>>>> the
>> >> > > >>>>>>>>>>>> leader, but we can leave the replica threading model
>> >> > > >>>> unchanged.
>> >> > > >>>>>> So,
>> >> > > >>>>>>>>>>>> overall, this still seems to be a simpler approach.
>> >> > > >>>>>>>>>>>>
>> >> > > >>>>>>>>>>>> Thanks,
>> >> > > >>>>>>>>>>>>
>> >> > > >>>>>>>>>>>> Jun
>> >> > > >>>>>>>>>>>>
>> >> > > >>>>>>>>>>>> On Tue, Aug 9, 2016 at 11:57 AM, Mayuresh Gharat <
>> >> > > >>>>>>>>>>>> gharatmayures...@gmail.com <javascript:;>
>> >> > > >>>>>>>>>>>>> wrote:
>> >> > > >>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>> Nice write up Ben.
>> >> > > >>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>> I agree with Joel for keeping this simple by excluding
>> >> > > >> the
>> >> > > >>>>>>>> partitions
>> >> > > >>>>>>>>>>>> from
>> >> > > >>>>>>>>>>>>> the fetch request/response when the quota is violated
>> at
>> >> > > >>> the
>> >> > > >>>>>>>> follower
>> >> > > >>>>>>>>>>> or
>> >> > > >>>>>>>>>>>>> leader instead of having a separate set of threads for
>> >> > > >>>>> handling
>> >> > > >>>>>>> the
>> >> > > >>>>>>>>>>> quota
>> >> > > >>>>>>>>>>>>> and non quota cases. Even though its different from
>> the
>> >> > > >>>>> current
>> >> > > >>>>>>>> quota
>> >> > > >>>>>>>>>>>>> implementation it should be OK since its internal to
>> >> > > >>> brokers
>> >> > > >>>>> and
>> >> > > >>>>>>> can
>> >> > > >>>>>>>>> be
>> >> > > >>>>>>>>>>>>> handled by tuning the quota configs for it
>> appropriately
>> >> > > >>> by
>> >> > > >>>>> the
>> >> > > >>>>>>>>> admins.
>> >> > > >>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>> Also can you elaborate with an example how this would
>> be
>> >> > > >>>>>> handled :
>> >> > > >>>>>>>>>>>>> *guaranteeing
>> >> > > >>>>>>>>>>>>> ordering of updates when replicas shift threads*
>> >> > > >>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>> Thanks,
>> >> > > >>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>> Mayuresh
>> >> > > >>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>> On Tue, Aug 9, 2016 at 10:49 AM, Joel Koshy <
>> >> > > >>>>>> jjkosh...@gmail.com <javascript:;>>
>> >> > > >>>>>>>>>>> wrote:
>> >> > > >>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>> On the need for both leader/follower throttling: that
>> >> > > >>> makes
>> >> > > >>>>>>> sense -
>> >> > > >>>>>>>>>>>>> thanks
>> >> > > >>>>>>>>>>>>>> for clarifying. For completeness, can we add this
>> >> > > >> detail
>> >> > > >>> to
>> >> > > >>>>> the
>> >> > > >>>>>>>> doc -
>> >> > > >>>>>>>>>>>>> say,
>> >> > > >>>>>>>>>>>>>> after the quote that I pasted earlier?
>> >> > > >>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>> From an implementation perspective though: I’m still
>> >> > > >>>>> interested
>> >> > > >>>>>>> in
>> >> > > >>>>>>>>>>> the
>> >> > > >>>>>>>>>>>>>> simplicity of not having to add separate replica
>> >> > > >>> fetchers,
>> >> > > >>>>>> delay
>> >> > > >>>>>>>>>>> queue
>> >> > > >>>>>>>>>>>> on
>> >> > > >>>>>>>>>>>>>> the leader, and “move” partitions from the throttled
>> >> > > >>>> replica
>> >> > > >>>>>>>> fetchers
>> >> > > >>>>>>>>>>>> to
>> >> > > >>>>>>>>>>>>>> the regular replica fetchers once caught up.
>> >> > > >>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>> Instead, I think it would work and be simpler to
>> >> > > >> include
>> >> > > >>> or
>> >> > > >>>>>>> exclude
>> >> > > >>>>>>>>>>> the
>> >> > > >>>>>>>>>>>>>> partitions in the fetch request from the follower and
>> >> > > >>> fetch
>> >> > > >>>>>>>> response
>> >> > > >>>>>>>>>>>> from
>> >> > > >>>>>>>>>>>>>> the leader when the quota is violated. The issue of
>> >> > > >>>> fairness
>> >> > > >>>>>> that
>> >> > > >>>>>>>> Ben
>> >> > > >>>>>>>>>>>>> noted
>> >> > > >>>>>>>>>>>>>> may be a wash between the two options (that Ben wrote
>> >> > > >> in
>> >> > > >>>> his
>> >> > > >>>>>>>> email).
>> >> > > >>>>>>>>>>>> With
>> >> > > >>>>>>>>>>>>>> the default quota delay mechanism, partitions get
>> >> > > >> delayed
>> >> > > >>>>>>>> essentially
>> >> > > >>>>>>>>>>>> at
>> >> > > >>>>>>>>>>>>>> random - i.e., whoever fetches at the time of quota
>> >> > > >>>> violation
>> >> > > >>>>>>> gets
>> >> > > >>>>>>>>>>>>> delayed
>> >> > > >>>>>>>>>>>>>> at the leader. So we can adopt a similar policy in
>> >> > > >>> choosing
>> >> > > >>>>> to
>> >> > > >>>>>>>>>>> truncate
>> >> > > >>>>>>>>>>>>>> partitions in fetch responses. i.e., if at the time
>> of
>> >> > > >>>>> handling
>> >> > > >>>>>>> the
>> >> > > >>>>>>>>>>>> fetch
>> >> > > >>>>>>>>>>>>>> the “effect” replication rate exceeds the quota then
>> >> > > >>> either
>> >> > > >>>>>> empty
>> >> > > >>>>>>>> or
>> >> > > >>>>>>>>>>>>>> truncate those partitions from the response. (BTW
>> >> > > >> effect
>> >> > > >>>>>>>> replication
>> >> > > >>>>>>>>>>> is
>> >> > > >>>>>>>>>>>>>> your terminology in the wiki - i.e., replication due
>> to
>> >> > > >>>>>> partition
>> >> > > >>>>>>>>>>>>>> reassignment, adding brokers, etc.)
>> >> > > >>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>> While this may be slightly different from the
>> existing
>> >> > > >>>> quota
>> >> > > >>>>>>>>>>> mechanism
>> >> > > >>>>>>>>>>>> I
>> >> > > >>>>>>>>>>>>>> think the difference is small (since we would reuse
>> the
>> >> > > >>>> quota
>> >> > > >>>>>>>> manager
>> >> > > >>>>>>>>>>>> at
>> >> > > >>>>>>>>>>>>>> worst with some refactoring) and will be internal to
>> >> > > >> the
>> >> > > >>>>>> broker.
>> >> > > >>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>> So I guess the question is if this alternative is
>> >> > > >> simpler
>> >> > > >>>>>> enough
>> >> > > >>>>>>>> and
>> >> > > >>>>>>>>>>>>>> equally functional to not go with dedicated throttled
>> >> > > >>>> replica
>> >> > > >>>>>>>>>>> fetchers.
>> >> > > >>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>> On Tue, Aug 9, 2016 at 9:44 AM, Jun Rao <
>> >> > > >>> j...@confluent.io <javascript:;>>
>> >> > > >>>>>>> wrote:
>> >> > > >>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>>> Just to elaborate on what Ben said why we need
>> >> > > >>> throttling
>> >> > > >>>> on
>> >> > > >>>>>>> both
>> >> > > >>>>>>>>>>> the
>> >> > > >>>>>>>>>>>>>>> leader and the follower side.
>> >> > > >>>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>>> If we only have throttling on the follower side,
>> >> > > >>> consider
>> >> > > >>>> a
>> >> > > >>>>>> case
>> >> > > >>>>>>>>>>> that
>> >> > > >>>>>>>>>>>>> we
>> >> > > >>>>>>>>>>>>>>> add 5 more new brokers and want to move some
>> replicas
>> >> > > >>> from
>> >> > > >>>>>>>> existing
>> >> > > >>>>>>>>>>>>>> brokers
>> >> > > >>>>>>>>>>>>>>> over to those 5 brokers. Each of those broker is
>> going
>> >> > > >>> to
>> >> > > >>>>>> fetch
>> >> > > >>>>>>>>>>> data
>> >> > > >>>>>>>>>>>>> from
>> >> > > >>>>>>>>>>>>>>> all existing brokers. Then, it's possible that the
>> >> > > >>>>> aggregated
>> >> > > >>>>>>>> fetch
>> >> > > >>>>>>>>>>>>> load
>> >> > > >>>>>>>>>>>>>>> from those 5 brokers on a particular existing broker
>> >> > > >>>> exceeds
>> >> > > >>>>>> its
>> >> > > >>>>>>>>>>>>> outgoing
>> >> > > >>>>>>>>>>>>>>> network bandwidth, even though the inbounding
>> traffic
>> >> > > >> on
>> >> > > >>>>> each
>> >> > > >>>>>> of
>> >> > > >>>>>>>>>>>> those
>> >> > > >>>>>>>>>>>>> 5
>> >> > > >>>>>>>>>>>>>>> brokers is bounded.
>> >> > > >>>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>>> If we only have throttling on the leader side,
>> >> > > >> consider
>> >> > > >>>> the
>> >> > > >>>>>> same
>> >> > > >>>>>>>>>>>>> example
>> >> > > >>>>>>>>>>>>>>> above. It's possible for the incoming traffic to
>> each
>> >> > > >> of
>> >> > > >>>>>> those 5
>> >> > > >>>>>>>>>>>>> brokers
>> >> > > >>>>>>>>>>>>>> to
>> >> > > >>>>>>>>>>>>>>> exceed its network bandwidth since it is fetching
>> data
>> >> > > >>>> from
>> >> > > >>>>>> all
>> >> > > >>>>>>>>>>>>> existing
>> >> > > >>>>>>>>>>>>>>> brokers.
>> >> > > >>>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>>> So, being able to set a quota on both the follower
>> and
>> >> > > >>> the
>> >> > > >>>>>>> leader
>> >> > > >>>>>>>>>>>> side
>> >> > > >>>>>>>>>>>>>>> protects both cases.
>> >> > > >>>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>>> Thanks,
>> >> > > >>>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>>> Jun
>> >> > > >>>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>>> On Tue, Aug 9, 2016 at 4:43 AM, Ben Stopford <
>> >> > > >>>>>> b...@confluent.io <javascript:;>>
>> >> > > >>>>>>>>>>>> wrote:
>> >> > > >>>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>>>> Hi Joel
>> >> > > >>>>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>>>> Thanks for taking the time to look at this.
>> >> > > >>> Appreciated.
>> >> > > >>>>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>>>> Regarding throttling on both leader and follower,
>> >> > > >> this
>> >> > > >>>>>> proposal
>> >> > > >>>>>>>>>>>>> covers
>> >> > > >>>>>>>>>>>>>> a
>> >> > > >>>>>>>>>>>>>>>> more general solution which can guarantee a quota,
>> >> > > >> even
>> >> > > >>>>> when
>> >> > > >>>>>> a
>> >> > > >>>>>>>>>>>>>> rebalance
>> >> > > >>>>>>>>>>>>>>>> operation produces an asymmetric profile of load.
>> >> > > >> This
>> >> > > >>>>> means
>> >> > > >>>>>>>>>>>>>>> administrators
>> >> > > >>>>>>>>>>>>>>>> don’t need to calculate the impact that a
>> >> > > >> follower-only
>> >> > > >>>>> quota
>> >> > > >>>>>>>>>>> will
>> >> > > >>>>>>>>>>>>> have
>> >> > > >>>>>>>>>>>>>>> on
>> >> > > >>>>>>>>>>>>>>>> the leaders they are fetching from. So for example
>> >> > > >>> where
>> >> > > >>>>>>> replica
>> >> > > >>>>>>>>>>>>> sizes
>> >> > > >>>>>>>>>>>>>>> are
>> >> > > >>>>>>>>>>>>>>>> skewed or where a partial rebalance is required.
>> >> > > >>>>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>>>> Having said that, even with both leader and
>> follower
>> >> > > >>>>> quotas,
>> >> > > >>>>>>> the
>> >> > > >>>>>>>>>>>> use
>> >> > > >>>>>>>>>>>>> of
>> >> > > >>>>>>>>>>>>>>>> additional threads is actually optional. There
>> appear
>> >> > > >>> to
>> >> > > >>>> be
>> >> > > >>>>>> two
>> >> > > >>>>>>>>>>>>> general
>> >> > > >>>>>>>>>>>>>>>> approaches (1) omit partitions from fetch requests
>> >> > > >>>>>> (follower) /
>> >> > > >>>>>>>>>>>> fetch
>> >> > > >>>>>>>>>>>>>>>> responses (leader) when they exceed their quota (2)
>> >> > > >>> delay
>> >> > > >>>>>> them,
>> >> > > >>>>>>>>>>> as
>> >> > > >>>>>>>>>>>>> the
>> >> > > >>>>>>>>>>>>>>>> existing quota mechanism does, using separate
>> >> > > >> fetchers.
>> >> > > >>>>> Both
>> >> > > >>>>>>>>>>> appear
>> >> > > >>>>>>>>>>>>>>> valid,
>> >> > > >>>>>>>>>>>>>>>> but with slightly different design tradeoffs.
>> >> > > >>>>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>>>> The issue with approach (1) is that it departs
>> >> > > >> somewhat
>> >> > > >>>>> from
>> >> > > >>>>>>> the
>> >> > > >>>>>>>>>>>>>> existing
>> >> > > >>>>>>>>>>>>>>>> quotas implementation, and must include a notion of
>> >> > > >>>>> fairness
>> >> > > >>>>>>>>>>>> within,
>> >> > > >>>>>>>>>>>>>> the
>> >> > > >>>>>>>>>>>>>>>> now size-bounded, request and response. The issue
>> >> > > >> with
>> >> > > >>>> (2)
>> >> > > >>>>> is
>> >> > > >>>>>>>>>>>>>>> guaranteeing
>> >> > > >>>>>>>>>>>>>>>> ordering of updates when replicas shift threads,
>> but
>> >> > > >>> this
>> >> > > >>>>> is
>> >> > > >>>>>>>>>>>> handled,
>> >> > > >>>>>>>>>>>>>> for
>> >> > > >>>>>>>>>>>>>>>> the most part, in the code today.
>> >> > > >>>>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>>>> I’ve updated the rejected alternatives section to
>> >> > > >> make
>> >> > > >>>>> this a
>> >> > > >>>>>>>>>>>> little
>> >> > > >>>>>>>>>>>>>>>> clearer.
>> >> > > >>>>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>>>> B
>> >> > > >>>>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>>>>> On 8 Aug 2016, at 20:38, Joel Koshy <
>> >> > > >>>> jjkosh...@gmail.com <javascript:;>>
>> >> > > >>>>>>>>>>> wrote:
>> >> > > >>>>>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>>>>> Hi Ben,
>> >> > > >>>>>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>>>>> Thanks for the detailed write-up. So the proposal
>> >> > > >>>> involves
>> >> > > >>>>>>>>>>>>>>>> self-throttling
>> >> > > >>>>>>>>>>>>>>>>> on the fetcher side and throttling at the leader.
>> >> > > >> Can
>> >> > > >>>> you
>> >> > > >>>>>>>>>>>> elaborate
>> >> > > >>>>>>>>>>>>>> on
>> >> > > >>>>>>>>>>>>>>>> the
>> >> > > >>>>>>>>>>>>>>>>> reasoning that is given on the wiki: *“The
>> throttle
>> >> > > >> is
>> >> > > >>>>>> applied
>> >> > > >>>>>>>>>>> to
>> >> > > >>>>>>>>>>>>>> both
>> >> > > >>>>>>>>>>>>>>>>> leaders and followers. This allows the admin to
>> >> > > >> exert
>> >> > > >>>>> strong
>> >> > > >>>>>>>>>>>>>> guarantees
>> >> > > >>>>>>>>>>>>>>>> on
>> >> > > >>>>>>>>>>>>>>>>> the throttle limit".* Is there any reason why one
>> or
>> >> > > >>> the
>> >> > > >>>>>> other
>> >> > > >>>>>>>>>>>>>> wouldn't
>> >> > > >>>>>>>>>>>>>>>> be
>> >> > > >>>>>>>>>>>>>>>>> sufficient.
>> >> > > >>>>>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>>>>> Specifically, if we were to only do
>> self-throttling
>> >> > > >> on
>> >> > > >>>> the
>> >> > > >>>>>>>>>>>>> fetchers,
>> >> > > >>>>>>>>>>>>>> we
>> >> > > >>>>>>>>>>>>>>>>> could potentially avoid the additional replica
>> >> > > >>> fetchers
>> >> > > >>>>>> right?
>> >> > > >>>>>>>>>>>>> i.e.,
>> >> > > >>>>>>>>>>>>>>> the
>> >> > > >>>>>>>>>>>>>>>>> replica fetchers would maintain its quota metrics
>> as
>> >> > > >>> you
>> >> > > >>>>>>>>>>> proposed
>> >> > > >>>>>>>>>>>>> and
>> >> > > >>>>>>>>>>>>>>>> each
>> >> > > >>>>>>>>>>>>>>>>> (normal) replica fetch presents an opportunity to
>> >> > > >> make
>> >> > > >>>>>>> progress
>> >> > > >>>>>>>>>>>> for
>> >> > > >>>>>>>>>>>>>> the
>> >> > > >>>>>>>>>>>>>>>>> throttled partitions as long as their effective
>> >> > > >>>>> consumption
>> >> > > >>>>>>>>>>> rate
>> >> > > >>>>>>>>>>>> is
>> >> > > >>>>>>>>>>>>>>> below
>> >> > > >>>>>>>>>>>>>>>>> the quota limit. If it exceeds the consumption
>> rate
>> >> > > >>> then
>> >> > > >>>>>> don’t
>> >> > > >>>>>>>>>>>>>> include
>> >> > > >>>>>>>>>>>>>>>> the
>> >> > > >>>>>>>>>>>>>>>>> throttled partitions in the subsequent fetch
>> >> > > >> requests
>> >> > > >>>>> until
>> >> > > >>>>>>> the
>> >> > > >>>>>>>>>>>>>>> effective
>> >> > > >>>>>>>>>>>>>>>>> consumption rate for those partitions returns to
>> >> > > >>> within
>> >> > > >>>>> the
>> >> > > >>>>>>>>>>> quota
>> >> > > >>>>>>>>>>>>>>>> threshold.
>> >> > > >>>>>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>>>>> I have more questions on the proposal, but was
>> more
>> >> > > >>>>>> interested
>> >> > > >>>>>>>>>>> in
>> >> > > >>>>>>>>>>>>> the
>> >> > > >>>>>>>>>>>>>>>> above
>> >> > > >>>>>>>>>>>>>>>>> to see if it could simplify things a bit.
>> >> > > >>>>>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>>>>> Also, can you open up access to the google-doc
>> that
>> >> > > >>> you
>> >> > > >>>>> link
>> >> > > >>>>>>>>>>> to?
>> >> > > >>>>>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>>>>> Thanks,
>> >> > > >>>>>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>>>>> Joel
>> >> > > >>>>>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>>>>> On Mon, Aug 8, 2016 at 5:54 AM, Ben Stopford <
>> >> > > >>>>>>> b...@confluent.io <javascript:;>
>> >> > > >>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>> wrote:
>> >> > > >>>>>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>>>>>> We’ve created KIP-73: Replication Quotas
>> >> > > >>>>>>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>>>>>> The idea is to allow an admin to throttle moving
>> >> > > >>>>> replicas.
>> >> > > >>>>>>>>>>> Full
>> >> > > >>>>>>>>>>>>>>> details
>> >> > > >>>>>>>>>>>>>>>>>> are here:
>> >> > > >>>>>>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>>>>>> https://cwiki.apache.org/
>> >> > > >>> confluence/display/KAFKA/KIP-
>> >> > > >>>>> 73+
>> >> > > >>>>>>>>>>>>>>>>>> Replication+Quotas <
>> https://cwiki.apache.org/conf
>> >> > > >>>>>>>>>>>>>>>>>> luence/display/KAFKA/KIP-73+Replication+Quotas>
>> >> > > >>>>>>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>>>>>> Please take a look and let us know your thoughts.
>> >> > > >>>>>>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>>>>>> Thanks
>> >> > > >>>>>>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>>>>>> B
>> >> > > >>>>>>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>> --
>> >> > > >>>>>>>>>>>>> -Regards,
>> >> > > >>>>>>>>>>>>> Mayuresh R. Gharat
>> >> > > >>>>>>>>>>>>> (862) 250-7125
>> >> > > >>>>>>>>>>>>>
>> >> > > >>>>>>>>>>>>
>> >> > > >>>>>>>>>>>
>> >> > > >>>>>>>>>
>> >> > > >>>>>>>>>
>> >> > > >>>>>>>>
>> >> > > >>>>>>>>
>> >> > > >>>>>>>> --
>> >> > > >>>>>>>> -Regards,
>> >> > > >>>>>>>> Mayuresh R. Gharat
>> >> > > >>>>>>>> (862) 250-7125
>> >> > > >>>>>>>>
>> >> > > >>>>>>>
>> >> > > >>>>>>
>> >> > > >>>>>
>> >> > > >>>>
>> >> > > >>>>
>> >> > > >>>>
>> >> > > >>>> --
>> >> > > >>>> -Regards,
>> >> > > >>>> Mayuresh R. Gharat
>> >> > > >>>> (862) 250-7125
>> >> > > >>>>
>> >> > > >>>
>> >> > > >>
>> >> > > >>
>> >> > > >> --
>> >> > > >> Ben Stopford
>> >> > > >>
>> >> > >
>> >> > >
>> >> >
>> >>
>>
>>
>>
>> --
>> Gwen Shapira
>> Product Manager | Confluent
>> 650.450.2760 | @gwenshap
>> Follow us: Twitter | blog
>>
>
>
>
> --
> *Todd Palino*
> Staff Site Reliability Engineer
> Data Infrastructure Streaming
>
>
>
> linkedin.com/in/toddpalino



-- 
Gwen Shapira
Product Manager | Confluent
650.450.2760 | @gwenshap
Follow us: Twitter | blog

Re: [DISCUSS] KIP-73: Replication Quotas

Reply via email to