Joel, Yes, for your second comment. The tricky thing is still to figure out which replicas to throttle and by how much since in general, admins probably don't want already in-sync or close to in-sync replicas to be throttled. It would be great to get Todd's opinion on this. Could you ping him?
Yes, we'd be happy to discuss auto-detection of effect traffic more offline. Thanks, Jun On Thu, Aug 18, 2016 at 10:21 AM, Joel Koshy <jjkosh...@gmail.com> wrote: > > For your first comment. We thought about determining "effect" replicas > > automatically as well. First, there are some tricky stuff that one has to > > > > Auto-detection of effect traffic: i'm fairly certain it's doable but > definitely tricky. I'm also not sure it is something worth tackling at the > outset. If we want to spend more time thinking over it even if it's just an > academic exercise I would be happy to brainstorm offline. > > > > For your second comment, we discussed that in the client quotas design. A > > down side of that for client quotas is that a client may be surprised > that > > its traffic is not throttled at one time, but throttled as another with > the > > same quota (basically, less predicability). You can imaging setting a > quota > > for all replication traffic and only slow down the "effect" replicas if > > needed. The thought is more or less the same as the above. It requires > more > > > > For clients, this is true. I think this is much less of an issue for > server-side replication since the "users" here are the Kafka SREs who > generally know these internal details. > > I think it would be valuable to get some feedback from SREs on the proposal > before proceeding to a vote. (ping Todd) > > Joel > > > > > > On Thu, Aug 18, 2016 at 9:37 AM, Ben Stopford <b...@confluent.io> wrote: > > > > > Hi Joel > > > > > > Ha! yes we had some similar thoughts, on both counts. Both are actually > > > good approaches, but come with some extra complexity. > > > > > > Segregating the replication type is tempting as it creates a more > general > > > solution. One issue is you need to draw a line between lagging and not > > > lagging. The ISR ‘limit' is a tempting divider, but has the side effect > > > that, once you drop out you get immediately throttled. Adding a > > > configurable divider is another option, but difficult for admins to > set, > > > and always a little arbitrary. A better idea is to prioritise, in > reverse > > > order to lag. But that also comes with additional complexity of its > own. > > > > > > Under throttling is also a tempting addition. That’s to say, if there’s > > > idle bandwidth lying around, not being used, why not use it to let > > lagging > > > brokers catch up. This involves some comparison to the maximum > bandwidth, > > > which could be configurable, or could be derived, with pros and cons > for > > > each. > > > > > > But the more general problem is actually quite hard to reason about, so > > > after some discussion we decided to settle on something simple, that we > > > felt we could get working, and extend to add these additional features > as > > > subsequent KIPs. > > > > > > I hope that seems reasonable. Jun may wish to add to this. > > > > > > B > > > > > > > > > > On 18 Aug 2016, at 06:56, Joel Koshy <jjkosh...@gmail.com> wrote: > > > > > > > > On Wed, Aug 17, 2016 at 9:13 PM, Ben Stopford <b...@confluent.io> > > wrote: > > > > > > > >> > > > >> Let's us know if you have any further thoughts on KIP-73, else we'll > > > kick > > > >> off a vote. > > > >> > > > > > > > > I think the mechanism for throttling replicas looks good. Just had a > > few > > > > more thoughts on the configuration section. What you have looks > > > reasonable, > > > > but I was wondering if it could be made simpler. You probably thought > > > > through these, so I'm curious to know your take. > > > > > > > > My guess is that most of the time, users would want to throttle all > > > effect > > > > replication - due to partition reassignments, adding brokers or a > > broker > > > > coming back online after an extended period of time. In all these > > > scenarios > > > > it may be possible to distinguish bootstrap (effect) vs normal > > > replication > > > > - based on how far the replica has to catch up. I'm wondering if it > is > > > > enough to just set an umbrella "effect" replication quota with > perhaps > > > > per-topic overrides (say if some topics are more important than > others) > > > as > > > > opposed to designating throttled replicas. > > > > > > > > Also, IIRC during client-side quota discussions we had considered the > > > > possibility of allowing clients to go above their quotas when > resources > > > are > > > > available. We ended up not doing that, but for replication throttling > > it > > > > may make sense - i.e., to treat the quota as a soft limit. Another > way > > to > > > > look at it is instead of ensuring "effect replication traffic does > not > > > flow > > > > faster than X bytes/sec" it may be useful to instead ensure that > > "effect > > > > replication traffic only flows as slowly as necessary (so as not to > > > > adversely affect normal replication traffic)." > > > > > > > > Thanks, > > > > > > > > Joel > > > > > > > >>> > > > >>>> On Thu, Aug 11, 2016 at 2:43 PM, Jun Rao <j...@confluent.io > > > >>> <javascript:;>> wrote: > > > >>>> > > > >>>>> Hi, Joel, > > > >>>>> > > > >>>>> Yes, the response size includes both throttled and unthrottled > > > >>> replicas. > > > >>>>> However, the response is only delayed up to max.wait if the > > response > > > >>> size > > > >>>>> is less than min.bytes, which matches the current behavior. So, > > there > > > >>> is > > > >>>> no > > > >>>>> extra delay to due throttling, right? For replica fetchers, the > > > >> default > > > >>>>> min.byte is 1. So, the response is only delayed if there is no > byte > > > >> in > > > >>>> the > > > >>>>> response, which is what we want. > > > >>>>> > > > >>>>> Thanks, > > > >>>>> > > > >>>>> Jun > > > >>>>> > > > >>>>> On Thu, Aug 11, 2016 at 11:53 AM, Joel Koshy < > jjkosh...@gmail.com > > > >>> <javascript:;>> > > > >>>> wrote: > > > >>>>> > > > >>>>>> Hi Jun, > > > >>>>>> > > > >>>>>> I'm not sure that would work unless we have separate replica > > > >>> fetchers, > > > >>>>>> since this would cause all replicas (including ones that are not > > > >>>>> throttled) > > > >>>>>> to get delayed. Instead, we could just have the leader populate > > the > > > >>>>>> throttle-time field of the response as a hint to the follower as > > to > > > >>> how > > > >>>>>> long it should wait before it adds those replicas back to its > > > >>>> subsequent > > > >>>>>> replica fetch requests. > > > >>>>>> > > > >>>>>> Thanks, > > > >>>>>> > > > >>>>>> Joel > > > >>>>>> > > > >>>>>> On Thu, Aug 11, 2016 at 9:50 AM, Jun Rao <j...@confluent.io > > > >>> <javascript:;>> wrote: > > > >>>>>> > > > >>>>>>> Mayuresh, > > > >>>>>>> > > > >>>>>>> That's a good question. I think if the response size (after > > > >> leader > > > >>>>>>> throttling) is smaller than min.bytes, we will just delay the > > > >>> sending > > > >>>>> of > > > >>>>>>> the response up to max.wait as we do now. This should prevent > > > >>>> frequent > > > >>>>>>> empty responses to the follower. > > > >>>>>>> > > > >>>>>>> Thanks, > > > >>>>>>> > > > >>>>>>> Jun > > > >>>>>>> > > > >>>>>>> On Wed, Aug 10, 2016 at 9:17 PM, Mayuresh Gharat < > > > >>>>>>> gharatmayures...@gmail.com <javascript:;> > > > >>>>>>>> wrote: > > > >>>>>>> > > > >>>>>>>> This might have been answered before. > > > >>>>>>>> I was wondering when the leader quota is reached and it sends > > > >>> empty > > > >>>>>>>> response ( If the inclusion of a partition, listed in the > > > >>> leader's > > > >>>>>>>> throttled-replicas list, causes the LeaderQuotaRate to be > > > >>> exceeded, > > > >>>>>> that > > > >>>>>>>> partition is omitted from the response (aka returns 0 > bytes).). > > > >>> At > > > >>>>> this > > > >>>>>>>> point the follower quota is NOT reached and the follower is > > > >> still > > > >>>>> going > > > >>>>>>> to > > > >>>>>>>> ask for the that partition in the next fetch request. Would it > > > >> be > > > >>>>> fair > > > >>>>>> to > > > >>>>>>>> add some logic there so that the follower backs off ( for some > > > >>>>>>> configurable > > > >>>>>>>> time) from including those partitions in the next fetch > > > >> request? > > > >>>>>>>> > > > >>>>>>>> Thanks, > > > >>>>>>>> > > > >>>>>>>> Mayuresh > > > >>>>>>>> > > > >>>>>>>> On Wed, Aug 10, 2016 at 8:06 AM, Ben Stopford < > > > >> b...@confluent.io > > > >>> <javascript:;>> > > > >>>>>> wrote: > > > >>>>>>>> > > > >>>>>>>>> Thanks again for the responses everyone. I’ve removed the the > > > >>>> extra > > > >>>>>>>>> fetcher threads from the proposal, switching to the > > > >>>> inclusion-based > > > >>>>>>>>> approach. The relevant section is: > > > >>>>>>>>> > > > >>>>>>>>> The follower makes a requests, using the fixed size of > > > >>>>>>>>> replica.fetch.response.max.bytes as per KIP-74 < > > > >>>>>>>> https://cwiki.apache.org/ > > > >>>>>>>>> confluence/display/KAFKA/KIP-74%3A+Add+Fetch+Response+Size+ > > > >>>>>>>> Limit+in+Bytes>. > > > >>>>>>>>> The order of the partitions in the fetch request are > > > >> randomised > > > >>>> to > > > >>>>>>> ensure > > > >>>>>>>>> fairness. > > > >>>>>>>>> When the leader receives the fetch request it processes the > > > >>>>>> partitions > > > >>>>>>> in > > > >>>>>>>>> the defined order, up to the response's size limit. If the > > > >>>>> inclusion > > > >>>>>>> of a > > > >>>>>>>>> partition, listed in the leader's throttled-replicas list, > > > >>> causes > > > >>>>> the > > > >>>>>>>>> LeaderQuotaRate to be exceeded, that partition is omitted > > > >> from > > > >>>> the > > > >>>>>>>> response > > > >>>>>>>>> (aka returns 0 bytes). Logically, this is of the form: > > > >>>>>>>>> var bytesAllowedForThrottledPartition = > > > >>>>> quota.recordAndMaybeAdjust( > > > >>>>>>>>> bytesRequestedForPartition) > > > >>>>>>>>> When the follower receives the fetch response, if it includes > > > >>>>>>> partitions > > > >>>>>>>>> in its throttled-partitions list, it increments the > > > >>>>>> FollowerQuotaRate: > > > >>>>>>>>> var includeThrottledPartitionsInNextRequest: Boolean = > > > >>>>>>>>> quota.recordAndEvaluate(previousResponseThrottledBytes) > > > >>>>>>>>> If the quota is exceeded, no throttled partitions will be > > > >>>> included > > > >>>>> in > > > >>>>>>> the > > > >>>>>>>>> next fetch request emitted by this replica fetcher thread. > > > >>>>>>>>> > > > >>>>>>>>> B > > > >>>>>>>>> > > > >>>>>>>>>> On 9 Aug 2016, at 23:34, Jun Rao <j...@confluent.io > > > >>> <javascript:;>> wrote: > > > >>>>>>>>>> > > > >>>>>>>>>> When there are several unthrottled replicas, we could also > > > >>> just > > > >>>>> do > > > >>>>>>>> what's > > > >>>>>>>>>> suggested in KIP-74. The client is responsible for > > > >> reordering > > > >>>> the > > > >>>>>>>>>> partitions and the leader fills in the bytes to those > > > >>>> partitions > > > >>>>> in > > > >>>>>>>>> order, > > > >>>>>>>>>> up to the quota limit. > > > >>>>>>>>>> > > > >>>>>>>>>> We could also do what you suggested. If quota is exceeded, > > > >>>>> include > > > >>>>>>>> empty > > > >>>>>>>>>> data in the response for throttled replicas. Keep doing > > > >> that > > > >>>>> until > > > >>>>>>>> enough > > > >>>>>>>>>> time has passed so that the quota is no longer exceeded. > > > >> This > > > >>>>>>>> potentially > > > >>>>>>>>>> allows better batching per partition. Not sure if the two > > > >>>> makes a > > > >>>>>> big > > > >>>>>>>>>> difference in practice though. > > > >>>>>>>>>> > > > >>>>>>>>>> Thanks, > > > >>>>>>>>>> > > > >>>>>>>>>> Jun > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> On Tue, Aug 9, 2016 at 2:31 PM, Joel Koshy < > > > >>>> jjkosh...@gmail.com <javascript:;>> > > > >>>>>>>> wrote: > > > >>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> On the leader side, one challenge is related to the > > > >>> fairness > > > >>>>>> issue > > > >>>>>>>> that > > > >>>>>>>>>>> Ben > > > >>>>>>>>>>>> brought up. The question is what if the fetch response > > > >>> limit > > > >>>> is > > > >>>>>>>> filled > > > >>>>>>>>> up > > > >>>>>>>>>>>> by the throttled replicas? If this happens constantly, we > > > >>>> will > > > >>>>>>> delay > > > >>>>>>>>> the > > > >>>>>>>>>>>> progress of those un-throttled replicas. However, I think > > > >>> we > > > >>>>> can > > > >>>>>>>>> address > > > >>>>>>>>>>>> this issue by trying to fill up the unthrottled replicas > > > >> in > > > >>>> the > > > >>>>>>>>> response > > > >>>>>>>>>>>> first. So, the algorithm would be. Fill up unthrottled > > > >>>> replicas > > > >>>>>> up > > > >>>>>>> to > > > >>>>>>>>> the > > > >>>>>>>>>>>> fetch response limit. If there is space left, fill up > > > >>>> throttled > > > >>>>>>>>> replicas. > > > >>>>>>>>>>>> If quota is exceeded for the throttled replicas, reduce > > > >> the > > > >>>>> bytes > > > >>>>>>> in > > > >>>>>>>>> the > > > >>>>>>>>>>>> throttled replicas in the response accordingly. > > > >>>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>>> Right - that's what I was trying to convey by truncation > > > >> (vs > > > >>>>>> empty). > > > >>>>>>>> So > > > >>>>>>>>> we > > > >>>>>>>>>>> would attempt to fill the response for throttled > > > >> partitions > > > >>> as > > > >>>>>> much > > > >>>>>>> as > > > >>>>>>>>> we > > > >>>>>>>>>>> can before hitting the quota limit. There is one more > > > >> detail > > > >>>> to > > > >>>>>>> handle > > > >>>>>>>>> in > > > >>>>>>>>>>> this: if there are several throttled partitions and not > > > >>> enough > > > >>>>>>>> remaining > > > >>>>>>>>>>> allowance in the fetch response to include all the > > > >> throttled > > > >>>>>>> replicas > > > >>>>>>>>> then > > > >>>>>>>>>>> we would need to decide which of those partitions get a > > > >>> share; > > > >>>>>> which > > > >>>>>>>> is > > > >>>>>>>>> why > > > >>>>>>>>>>> I'm wondering if it is easier to return empty for those > > > >>>>> partitions > > > >>>>>>>>> entirely > > > >>>>>>>>>>> in the fetch response - they will make progress in the > > > >>>>> subsequent > > > >>>>>>>>> fetch. If > > > >>>>>>>>>>> they don't make fast enough progress then that would be a > > > >>> case > > > >>>>> for > > > >>>>>>>>> raising > > > >>>>>>>>>>> the threshold or letting it complete at an off-peak time. > > > >>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> With this approach, we need some new logic to handle > > > >>>> throttling > > > >>>>>> on > > > >>>>>>>> the > > > >>>>>>>>>>>> leader, but we can leave the replica threading model > > > >>>> unchanged. > > > >>>>>> So, > > > >>>>>>>>>>>> overall, this still seems to be a simpler approach. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> Thanks, > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> Jun > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> On Tue, Aug 9, 2016 at 11:57 AM, Mayuresh Gharat < > > > >>>>>>>>>>>> gharatmayures...@gmail.com <javascript:;> > > > >>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>> > > > >>>>>>>>>>>>> Nice write up Ben. > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> I agree with Joel for keeping this simple by excluding > > > >> the > > > >>>>>>>> partitions > > > >>>>>>>>>>>> from > > > >>>>>>>>>>>>> the fetch request/response when the quota is violated at > > > >>> the > > > >>>>>>>> follower > > > >>>>>>>>>>> or > > > >>>>>>>>>>>>> leader instead of having a separate set of threads for > > > >>>>> handling > > > >>>>>>> the > > > >>>>>>>>>>> quota > > > >>>>>>>>>>>>> and non quota cases. Even though its different from the > > > >>>>> current > > > >>>>>>>> quota > > > >>>>>>>>>>>>> implementation it should be OK since its internal to > > > >>> brokers > > > >>>>> and > > > >>>>>>> can > > > >>>>>>>>> be > > > >>>>>>>>>>>>> handled by tuning the quota configs for it appropriately > > > >>> by > > > >>>>> the > > > >>>>>>>>> admins. > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> Also can you elaborate with an example how this would be > > > >>>>>> handled : > > > >>>>>>>>>>>>> *guaranteeing > > > >>>>>>>>>>>>> ordering of updates when replicas shift threads* > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> Thanks, > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> Mayuresh > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> On Tue, Aug 9, 2016 at 10:49 AM, Joel Koshy < > > > >>>>>> jjkosh...@gmail.com <javascript:;>> > > > >>>>>>>>>>> wrote: > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>>> On the need for both leader/follower throttling: that > > > >>> makes > > > >>>>>>> sense - > > > >>>>>>>>>>>>> thanks > > > >>>>>>>>>>>>>> for clarifying. For completeness, can we add this > > > >> detail > > > >>> to > > > >>>>> the > > > >>>>>>>> doc - > > > >>>>>>>>>>>>> say, > > > >>>>>>>>>>>>>> after the quote that I pasted earlier? > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> From an implementation perspective though: I’m still > > > >>>>> interested > > > >>>>>>> in > > > >>>>>>>>>>> the > > > >>>>>>>>>>>>>> simplicity of not having to add separate replica > > > >>> fetchers, > > > >>>>>> delay > > > >>>>>>>>>>> queue > > > >>>>>>>>>>>> on > > > >>>>>>>>>>>>>> the leader, and “move” partitions from the throttled > > > >>>> replica > > > >>>>>>>> fetchers > > > >>>>>>>>>>>> to > > > >>>>>>>>>>>>>> the regular replica fetchers once caught up. > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> Instead, I think it would work and be simpler to > > > >> include > > > >>> or > > > >>>>>>> exclude > > > >>>>>>>>>>> the > > > >>>>>>>>>>>>>> partitions in the fetch request from the follower and > > > >>> fetch > > > >>>>>>>> response > > > >>>>>>>>>>>> from > > > >>>>>>>>>>>>>> the leader when the quota is violated. The issue of > > > >>>> fairness > > > >>>>>> that > > > >>>>>>>> Ben > > > >>>>>>>>>>>>> noted > > > >>>>>>>>>>>>>> may be a wash between the two options (that Ben wrote > > > >> in > > > >>>> his > > > >>>>>>>> email). > > > >>>>>>>>>>>> With > > > >>>>>>>>>>>>>> the default quota delay mechanism, partitions get > > > >> delayed > > > >>>>>>>> essentially > > > >>>>>>>>>>>> at > > > >>>>>>>>>>>>>> random - i.e., whoever fetches at the time of quota > > > >>>> violation > > > >>>>>>> gets > > > >>>>>>>>>>>>> delayed > > > >>>>>>>>>>>>>> at the leader. So we can adopt a similar policy in > > > >>> choosing > > > >>>>> to > > > >>>>>>>>>>> truncate > > > >>>>>>>>>>>>>> partitions in fetch responses. i.e., if at the time of > > > >>>>> handling > > > >>>>>>> the > > > >>>>>>>>>>>> fetch > > > >>>>>>>>>>>>>> the “effect” replication rate exceeds the quota then > > > >>> either > > > >>>>>> empty > > > >>>>>>>> or > > > >>>>>>>>>>>>>> truncate those partitions from the response. (BTW > > > >> effect > > > >>>>>>>> replication > > > >>>>>>>>>>> is > > > >>>>>>>>>>>>>> your terminology in the wiki - i.e., replication due to > > > >>>>>> partition > > > >>>>>>>>>>>>>> reassignment, adding brokers, etc.) > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> While this may be slightly different from the existing > > > >>>> quota > > > >>>>>>>>>>> mechanism > > > >>>>>>>>>>>> I > > > >>>>>>>>>>>>>> think the difference is small (since we would reuse the > > > >>>> quota > > > >>>>>>>> manager > > > >>>>>>>>>>>> at > > > >>>>>>>>>>>>>> worst with some refactoring) and will be internal to > > > >> the > > > >>>>>> broker. > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> So I guess the question is if this alternative is > > > >> simpler > > > >>>>>> enough > > > >>>>>>>> and > > > >>>>>>>>>>>>>> equally functional to not go with dedicated throttled > > > >>>> replica > > > >>>>>>>>>>> fetchers. > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> On Tue, Aug 9, 2016 at 9:44 AM, Jun Rao < > > > >>> j...@confluent.io <javascript:;>> > > > >>>>>>> wrote: > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> Just to elaborate on what Ben said why we need > > > >>> throttling > > > >>>> on > > > >>>>>>> both > > > >>>>>>>>>>> the > > > >>>>>>>>>>>>>>> leader and the follower side. > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> If we only have throttling on the follower side, > > > >>> consider > > > >>>> a > > > >>>>>> case > > > >>>>>>>>>>> that > > > >>>>>>>>>>>>> we > > > >>>>>>>>>>>>>>> add 5 more new brokers and want to move some replicas > > > >>> from > > > >>>>>>>> existing > > > >>>>>>>>>>>>>> brokers > > > >>>>>>>>>>>>>>> over to those 5 brokers. Each of those broker is going > > > >>> to > > > >>>>>> fetch > > > >>>>>>>>>>> data > > > >>>>>>>>>>>>> from > > > >>>>>>>>>>>>>>> all existing brokers. Then, it's possible that the > > > >>>>> aggregated > > > >>>>>>>> fetch > > > >>>>>>>>>>>>> load > > > >>>>>>>>>>>>>>> from those 5 brokers on a particular existing broker > > > >>>> exceeds > > > >>>>>> its > > > >>>>>>>>>>>>> outgoing > > > >>>>>>>>>>>>>>> network bandwidth, even though the inbounding traffic > > > >> on > > > >>>>> each > > > >>>>>> of > > > >>>>>>>>>>>> those > > > >>>>>>>>>>>>> 5 > > > >>>>>>>>>>>>>>> brokers is bounded. > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> If we only have throttling on the leader side, > > > >> consider > > > >>>> the > > > >>>>>> same > > > >>>>>>>>>>>>> example > > > >>>>>>>>>>>>>>> above. It's possible for the incoming traffic to each > > > >> of > > > >>>>>> those 5 > > > >>>>>>>>>>>>> brokers > > > >>>>>>>>>>>>>> to > > > >>>>>>>>>>>>>>> exceed its network bandwidth since it is fetching data > > > >>>> from > > > >>>>>> all > > > >>>>>>>>>>>>> existing > > > >>>>>>>>>>>>>>> brokers. > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> So, being able to set a quota on both the follower and > > > >>> the > > > >>>>>>> leader > > > >>>>>>>>>>>> side > > > >>>>>>>>>>>>>>> protects both cases. > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> Thanks, > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> Jun > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> On Tue, Aug 9, 2016 at 4:43 AM, Ben Stopford < > > > >>>>>> b...@confluent.io <javascript:;>> > > > >>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> Hi Joel > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> Thanks for taking the time to look at this. > > > >>> Appreciated. > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> Regarding throttling on both leader and follower, > > > >> this > > > >>>>>> proposal > > > >>>>>>>>>>>>> covers > > > >>>>>>>>>>>>>> a > > > >>>>>>>>>>>>>>>> more general solution which can guarantee a quota, > > > >> even > > > >>>>> when > > > >>>>>> a > > > >>>>>>>>>>>>>> rebalance > > > >>>>>>>>>>>>>>>> operation produces an asymmetric profile of load. > > > >> This > > > >>>>> means > > > >>>>>>>>>>>>>>> administrators > > > >>>>>>>>>>>>>>>> don’t need to calculate the impact that a > > > >> follower-only > > > >>>>> quota > > > >>>>>>>>>>> will > > > >>>>>>>>>>>>> have > > > >>>>>>>>>>>>>>> on > > > >>>>>>>>>>>>>>>> the leaders they are fetching from. So for example > > > >>> where > > > >>>>>>> replica > > > >>>>>>>>>>>>> sizes > > > >>>>>>>>>>>>>>> are > > > >>>>>>>>>>>>>>>> skewed or where a partial rebalance is required. > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> Having said that, even with both leader and follower > > > >>>>> quotas, > > > >>>>>>> the > > > >>>>>>>>>>>> use > > > >>>>>>>>>>>>> of > > > >>>>>>>>>>>>>>>> additional threads is actually optional. There appear > > > >>> to > > > >>>> be > > > >>>>>> two > > > >>>>>>>>>>>>> general > > > >>>>>>>>>>>>>>>> approaches (1) omit partitions from fetch requests > > > >>>>>> (follower) / > > > >>>>>>>>>>>> fetch > > > >>>>>>>>>>>>>>>> responses (leader) when they exceed their quota (2) > > > >>> delay > > > >>>>>> them, > > > >>>>>>>>>>> as > > > >>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>> existing quota mechanism does, using separate > > > >> fetchers. > > > >>>>> Both > > > >>>>>>>>>>> appear > > > >>>>>>>>>>>>>>> valid, > > > >>>>>>>>>>>>>>>> but with slightly different design tradeoffs. > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> The issue with approach (1) is that it departs > > > >> somewhat > > > >>>>> from > > > >>>>>>> the > > > >>>>>>>>>>>>>> existing > > > >>>>>>>>>>>>>>>> quotas implementation, and must include a notion of > > > >>>>> fairness > > > >>>>>>>>>>>> within, > > > >>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>> now size-bounded, request and response. The issue > > > >> with > > > >>>> (2) > > > >>>>> is > > > >>>>>>>>>>>>>>> guaranteeing > > > >>>>>>>>>>>>>>>> ordering of updates when replicas shift threads, but > > > >>> this > > > >>>>> is > > > >>>>>>>>>>>> handled, > > > >>>>>>>>>>>>>> for > > > >>>>>>>>>>>>>>>> the most part, in the code today. > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> I’ve updated the rejected alternatives section to > > > >> make > > > >>>>> this a > > > >>>>>>>>>>>> little > > > >>>>>>>>>>>>>>>> clearer. > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> B > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> On 8 Aug 2016, at 20:38, Joel Koshy < > > > >>>> jjkosh...@gmail.com <javascript:;>> > > > >>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> Hi Ben, > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> Thanks for the detailed write-up. So the proposal > > > >>>> involves > > > >>>>>>>>>>>>>>>> self-throttling > > > >>>>>>>>>>>>>>>>> on the fetcher side and throttling at the leader. > > > >> Can > > > >>>> you > > > >>>>>>>>>>>> elaborate > > > >>>>>>>>>>>>>> on > > > >>>>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>> reasoning that is given on the wiki: *“The throttle > > > >> is > > > >>>>>> applied > > > >>>>>>>>>>> to > > > >>>>>>>>>>>>>> both > > > >>>>>>>>>>>>>>>>> leaders and followers. This allows the admin to > > > >> exert > > > >>>>> strong > > > >>>>>>>>>>>>>> guarantees > > > >>>>>>>>>>>>>>>> on > > > >>>>>>>>>>>>>>>>> the throttle limit".* Is there any reason why one or > > > >>> the > > > >>>>>> other > > > >>>>>>>>>>>>>> wouldn't > > > >>>>>>>>>>>>>>>> be > > > >>>>>>>>>>>>>>>>> sufficient. > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> Specifically, if we were to only do self-throttling > > > >> on > > > >>>> the > > > >>>>>>>>>>>>> fetchers, > > > >>>>>>>>>>>>>> we > > > >>>>>>>>>>>>>>>>> could potentially avoid the additional replica > > > >>> fetchers > > > >>>>>> right? > > > >>>>>>>>>>>>> i.e., > > > >>>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>> replica fetchers would maintain its quota metrics as > > > >>> you > > > >>>>>>>>>>> proposed > > > >>>>>>>>>>>>> and > > > >>>>>>>>>>>>>>>> each > > > >>>>>>>>>>>>>>>>> (normal) replica fetch presents an opportunity to > > > >> make > > > >>>>>>> progress > > > >>>>>>>>>>>> for > > > >>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>> throttled partitions as long as their effective > > > >>>>> consumption > > > >>>>>>>>>>> rate > > > >>>>>>>>>>>> is > > > >>>>>>>>>>>>>>> below > > > >>>>>>>>>>>>>>>>> the quota limit. If it exceeds the consumption rate > > > >>> then > > > >>>>>> don’t > > > >>>>>>>>>>>>>> include > > > >>>>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>> throttled partitions in the subsequent fetch > > > >> requests > > > >>>>> until > > > >>>>>>> the > > > >>>>>>>>>>>>>>> effective > > > >>>>>>>>>>>>>>>>> consumption rate for those partitions returns to > > > >>> within > > > >>>>> the > > > >>>>>>>>>>> quota > > > >>>>>>>>>>>>>>>> threshold. > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> I have more questions on the proposal, but was more > > > >>>>>> interested > > > >>>>>>>>>>> in > > > >>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>> above > > > >>>>>>>>>>>>>>>>> to see if it could simplify things a bit. > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> Also, can you open up access to the google-doc that > > > >>> you > > > >>>>> link > > > >>>>>>>>>>> to? > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> Thanks, > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> Joel > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> On Mon, Aug 8, 2016 at 5:54 AM, Ben Stopford < > > > >>>>>>> b...@confluent.io <javascript:;> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> We’ve created KIP-73: Replication Quotas > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> The idea is to allow an admin to throttle moving > > > >>>>> replicas. > > > >>>>>>>>>>> Full > > > >>>>>>>>>>>>>>> details > > > >>>>>>>>>>>>>>>>>> are here: > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> https://cwiki.apache.org/ > > > >>> confluence/display/KAFKA/KIP- > > > >>>>> 73+ > > > >>>>>>>>>>>>>>>>>> Replication+Quotas <https://cwiki.apache.org/conf > > > >>>>>>>>>>>>>>>>>> luence/display/KAFKA/KIP-73+Replication+Quotas> > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> Please take a look and let us know your thoughts. > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> Thanks > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> B > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> -- > > > >>>>>>>>>>>>> -Regards, > > > >>>>>>>>>>>>> Mayuresh R. Gharat > > > >>>>>>>>>>>>> (862) 250-7125 > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> -- > > > >>>>>>>> -Regards, > > > >>>>>>>> Mayuresh R. Gharat > > > >>>>>>>> (862) 250-7125 > > > >>>>>>>> > > > >>>>>>> > > > >>>>>> > > > >>>>> > > > >>>> > > > >>>> > > > >>>> > > > >>>> -- > > > >>>> -Regards, > > > >>>> Mayuresh R. Gharat > > > >>>> (862) 250-7125 > > > >>>> > > > >>> > > > >> > > > >> > > > >> -- > > > >> Ben Stopford > > > >> > > > > > > > > >