Hi everyone Let's us know if you have any further thoughts on KIP-73, else we'll kick off a vote.
Thanks B On Friday, 12 August 2016, Jun Rao <j...@confluent.io> wrote: > Mayuresh, > > I was thinking of the following. > > If P1 has data and P2 is throttled, we will return empty data for P2 and > send the response back immediately. The follower will issue the next fetch > request immediately, but the leader won't return any data in P2 until the > quota is not exceeded. We are not delaying the fetch requests here. > However, there is no additional overhead compared with no throttling since > P1 always has data. > > If P1 has no data and P2 is throttled, the leader will return empty data > for both P1 and P2 after waiting in the Purgatory up to max.wait. This > prevents the follower from getting empty responses too frequently. > > Thanks, > > Jun > > On Thu, Aug 11, 2016 at 5:33 PM, Mayuresh Gharat < > gharatmayures...@gmail.com <javascript:;> > > wrote: > > > Hi Jun, > > > > Correct me if I am wrong. > > If the response size includes throttled and unthrottled replicas, I am > > thinking if this is possible : > > The leader broker B1 receives a fetch request partition P1 and P2 for a > > topic from replica broker B2. In this case lets say that only P2 is > > throttled on the leader and P1 is not. In that case we will add the data > > for P1 in the response in which case the min.bytes threshold will be > > crossed and the response will be returned back right? > > If we say that with this kip, we will throttle this fetch request > entirely, > > then we are essentially delaying response for partition P1 which is not > the > > throttled partition. > > > > Is it fair to say we can indicate to the follower in the fetch response, > > how much time it should wait till it adds back a fetch request for > > partition P2. > > > > Thanks, > > > > Mayuresh > > > > On Thu, Aug 11, 2016 at 2:43 PM, Jun Rao <j...@confluent.io > <javascript:;>> wrote: > > > > > Hi, Joel, > > > > > > Yes, the response size includes both throttled and unthrottled > replicas. > > > However, the response is only delayed up to max.wait if the response > size > > > is less than min.bytes, which matches the current behavior. So, there > is > > no > > > extra delay to due throttling, right? For replica fetchers, the default > > > min.byte is 1. So, the response is only delayed if there is no byte in > > the > > > response, which is what we want. > > > > > > Thanks, > > > > > > Jun > > > > > > On Thu, Aug 11, 2016 at 11:53 AM, Joel Koshy <jjkosh...@gmail.com > <javascript:;>> > > wrote: > > > > > > > Hi Jun, > > > > > > > > I'm not sure that would work unless we have separate replica > fetchers, > > > > since this would cause all replicas (including ones that are not > > > throttled) > > > > to get delayed. Instead, we could just have the leader populate the > > > > throttle-time field of the response as a hint to the follower as to > how > > > > long it should wait before it adds those replicas back to its > > subsequent > > > > replica fetch requests. > > > > > > > > Thanks, > > > > > > > > Joel > > > > > > > > On Thu, Aug 11, 2016 at 9:50 AM, Jun Rao <j...@confluent.io > <javascript:;>> wrote: > > > > > > > > > Mayuresh, > > > > > > > > > > That's a good question. I think if the response size (after leader > > > > > throttling) is smaller than min.bytes, we will just delay the > sending > > > of > > > > > the response up to max.wait as we do now. This should prevent > > frequent > > > > > empty responses to the follower. > > > > > > > > > > Thanks, > > > > > > > > > > Jun > > > > > > > > > > On Wed, Aug 10, 2016 at 9:17 PM, Mayuresh Gharat < > > > > > gharatmayures...@gmail.com <javascript:;> > > > > > > wrote: > > > > > > > > > > > This might have been answered before. > > > > > > I was wondering when the leader quota is reached and it sends > empty > > > > > > response ( If the inclusion of a partition, listed in the > leader's > > > > > > throttled-replicas list, causes the LeaderQuotaRate to be > exceeded, > > > > that > > > > > > partition is omitted from the response (aka returns 0 bytes).). > At > > > this > > > > > > point the follower quota is NOT reached and the follower is still > > > going > > > > > to > > > > > > ask for the that partition in the next fetch request. Would it be > > > fair > > > > to > > > > > > add some logic there so that the follower backs off ( for some > > > > > configurable > > > > > > time) from including those partitions in the next fetch request? > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Mayuresh > > > > > > > > > > > > On Wed, Aug 10, 2016 at 8:06 AM, Ben Stopford <b...@confluent.io > <javascript:;>> > > > > wrote: > > > > > > > > > > > > > Thanks again for the responses everyone. I’ve removed the the > > extra > > > > > > > fetcher threads from the proposal, switching to the > > inclusion-based > > > > > > > approach. The relevant section is: > > > > > > > > > > > > > > The follower makes a requests, using the fixed size of > > > > > > > replica.fetch.response.max.bytes as per KIP-74 < > > > > > > https://cwiki.apache.org/ > > > > > > > confluence/display/KAFKA/KIP-74%3A+Add+Fetch+Response+Size+ > > > > > > Limit+in+Bytes>. > > > > > > > The order of the partitions in the fetch request are randomised > > to > > > > > ensure > > > > > > > fairness. > > > > > > > When the leader receives the fetch request it processes the > > > > partitions > > > > > in > > > > > > > the defined order, up to the response's size limit. If the > > > inclusion > > > > > of a > > > > > > > partition, listed in the leader's throttled-replicas list, > causes > > > the > > > > > > > LeaderQuotaRate to be exceeded, that partition is omitted from > > the > > > > > > response > > > > > > > (aka returns 0 bytes). Logically, this is of the form: > > > > > > > var bytesAllowedForThrottledPartition = > > > quota.recordAndMaybeAdjust( > > > > > > > bytesRequestedForPartition) > > > > > > > When the follower receives the fetch response, if it includes > > > > > partitions > > > > > > > in its throttled-partitions list, it increments the > > > > FollowerQuotaRate: > > > > > > > var includeThrottledPartitionsInNextRequest: Boolean = > > > > > > > quota.recordAndEvaluate(previousResponseThrottledBytes) > > > > > > > If the quota is exceeded, no throttled partitions will be > > included > > > in > > > > > the > > > > > > > next fetch request emitted by this replica fetcher thread. > > > > > > > > > > > > > > B > > > > > > > > > > > > > > > On 9 Aug 2016, at 23:34, Jun Rao <j...@confluent.io > <javascript:;>> wrote: > > > > > > > > > > > > > > > > When there are several unthrottled replicas, we could also > just > > > do > > > > > > what's > > > > > > > > suggested in KIP-74. The client is responsible for reordering > > the > > > > > > > > partitions and the leader fills in the bytes to those > > partitions > > > in > > > > > > > order, > > > > > > > > up to the quota limit. > > > > > > > > > > > > > > > > We could also do what you suggested. If quota is exceeded, > > > include > > > > > > empty > > > > > > > > data in the response for throttled replicas. Keep doing that > > > until > > > > > > enough > > > > > > > > time has passed so that the quota is no longer exceeded. This > > > > > > potentially > > > > > > > > allows better batching per partition. Not sure if the two > > makes a > > > > big > > > > > > > > difference in practice though. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > Jun > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Aug 9, 2016 at 2:31 PM, Joel Koshy < > > jjkosh...@gmail.com <javascript:;>> > > > > > > wrote: > > > > > > > > > > > > > > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > > > >>> On the leader side, one challenge is related to the > fairness > > > > issue > > > > > > that > > > > > > > >> Ben > > > > > > > >>> brought up. The question is what if the fetch response > limit > > is > > > > > > filled > > > > > > > up > > > > > > > >>> by the throttled replicas? If this happens constantly, we > > will > > > > > delay > > > > > > > the > > > > > > > >>> progress of those un-throttled replicas. However, I think > we > > > can > > > > > > > address > > > > > > > >>> this issue by trying to fill up the unthrottled replicas in > > the > > > > > > > response > > > > > > > >>> first. So, the algorithm would be. Fill up unthrottled > > replicas > > > > up > > > > > to > > > > > > > the > > > > > > > >>> fetch response limit. If there is space left, fill up > > throttled > > > > > > > replicas. > > > > > > > >>> If quota is exceeded for the throttled replicas, reduce the > > > bytes > > > > > in > > > > > > > the > > > > > > > >>> throttled replicas in the response accordingly. > > > > > > > >>> > > > > > > > >> > > > > > > > >> Right - that's what I was trying to convey by truncation (vs > > > > empty). > > > > > > So > > > > > > > we > > > > > > > >> would attempt to fill the response for throttled partitions > as > > > > much > > > > > as > > > > > > > we > > > > > > > >> can before hitting the quota limit. There is one more detail > > to > > > > > handle > > > > > > > in > > > > > > > >> this: if there are several throttled partitions and not > enough > > > > > > remaining > > > > > > > >> allowance in the fetch response to include all the throttled > > > > > replicas > > > > > > > then > > > > > > > >> we would need to decide which of those partitions get a > share; > > > > which > > > > > > is > > > > > > > why > > > > > > > >> I'm wondering if it is easier to return empty for those > > > partitions > > > > > > > entirely > > > > > > > >> in the fetch response - they will make progress in the > > > subsequent > > > > > > > fetch. If > > > > > > > >> they don't make fast enough progress then that would be a > case > > > for > > > > > > > raising > > > > > > > >> the threshold or letting it complete at an off-peak time. > > > > > > > >> > > > > > > > >> > > > > > > > >>> > > > > > > > >>> With this approach, we need some new logic to handle > > throttling > > > > on > > > > > > the > > > > > > > >>> leader, but we can leave the replica threading model > > unchanged. > > > > So, > > > > > > > >>> overall, this still seems to be a simpler approach. > > > > > > > >>> > > > > > > > >>> Thanks, > > > > > > > >>> > > > > > > > >>> Jun > > > > > > > >>> > > > > > > > >>> On Tue, Aug 9, 2016 at 11:57 AM, Mayuresh Gharat < > > > > > > > >>> gharatmayures...@gmail.com <javascript:;> > > > > > > > >>>> wrote: > > > > > > > >>> > > > > > > > >>>> Nice write up Ben. > > > > > > > >>>> > > > > > > > >>>> I agree with Joel for keeping this simple by excluding the > > > > > > partitions > > > > > > > >>> from > > > > > > > >>>> the fetch request/response when the quota is violated at > the > > > > > > follower > > > > > > > >> or > > > > > > > >>>> leader instead of having a separate set of threads for > > > handling > > > > > the > > > > > > > >> quota > > > > > > > >>>> and non quota cases. Even though its different from the > > > current > > > > > > quota > > > > > > > >>>> implementation it should be OK since its internal to > brokers > > > and > > > > > can > > > > > > > be > > > > > > > >>>> handled by tuning the quota configs for it appropriately > by > > > the > > > > > > > admins. > > > > > > > >>>> > > > > > > > >>>> Also can you elaborate with an example how this would be > > > > handled : > > > > > > > >>>> *guaranteeing > > > > > > > >>>> ordering of updates when replicas shift threads* > > > > > > > >>>> > > > > > > > >>>> Thanks, > > > > > > > >>>> > > > > > > > >>>> Mayuresh > > > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>> On Tue, Aug 9, 2016 at 10:49 AM, Joel Koshy < > > > > jjkosh...@gmail.com <javascript:;>> > > > > > > > >> wrote: > > > > > > > >>>> > > > > > > > >>>>> On the need for both leader/follower throttling: that > makes > > > > > sense - > > > > > > > >>>> thanks > > > > > > > >>>>> for clarifying. For completeness, can we add this detail > to > > > the > > > > > > doc - > > > > > > > >>>> say, > > > > > > > >>>>> after the quote that I pasted earlier? > > > > > > > >>>>> > > > > > > > >>>>> From an implementation perspective though: I’m still > > > interested > > > > > in > > > > > > > >> the > > > > > > > >>>>> simplicity of not having to add separate replica > fetchers, > > > > delay > > > > > > > >> queue > > > > > > > >>> on > > > > > > > >>>>> the leader, and “move” partitions from the throttled > > replica > > > > > > fetchers > > > > > > > >>> to > > > > > > > >>>>> the regular replica fetchers once caught up. > > > > > > > >>>>> > > > > > > > >>>>> Instead, I think it would work and be simpler to include > or > > > > > exclude > > > > > > > >> the > > > > > > > >>>>> partitions in the fetch request from the follower and > fetch > > > > > > response > > > > > > > >>> from > > > > > > > >>>>> the leader when the quota is violated. The issue of > > fairness > > > > that > > > > > > Ben > > > > > > > >>>> noted > > > > > > > >>>>> may be a wash between the two options (that Ben wrote in > > his > > > > > > email). > > > > > > > >>> With > > > > > > > >>>>> the default quota delay mechanism, partitions get delayed > > > > > > essentially > > > > > > > >>> at > > > > > > > >>>>> random - i.e., whoever fetches at the time of quota > > violation > > > > > gets > > > > > > > >>>> delayed > > > > > > > >>>>> at the leader. So we can adopt a similar policy in > choosing > > > to > > > > > > > >> truncate > > > > > > > >>>>> partitions in fetch responses. i.e., if at the time of > > > handling > > > > > the > > > > > > > >>> fetch > > > > > > > >>>>> the “effect” replication rate exceeds the quota then > either > > > > empty > > > > > > or > > > > > > > >>>>> truncate those partitions from the response. (BTW effect > > > > > > replication > > > > > > > >> is > > > > > > > >>>>> your terminology in the wiki - i.e., replication due to > > > > partition > > > > > > > >>>>> reassignment, adding brokers, etc.) > > > > > > > >>>>> > > > > > > > >>>>> While this may be slightly different from the existing > > quota > > > > > > > >> mechanism > > > > > > > >>> I > > > > > > > >>>>> think the difference is small (since we would reuse the > > quota > > > > > > manager > > > > > > > >>> at > > > > > > > >>>>> worst with some refactoring) and will be internal to the > > > > broker. > > > > > > > >>>>> > > > > > > > >>>>> So I guess the question is if this alternative is simpler > > > > enough > > > > > > and > > > > > > > >>>>> equally functional to not go with dedicated throttled > > replica > > > > > > > >> fetchers. > > > > > > > >>>>> > > > > > > > >>>>> On Tue, Aug 9, 2016 at 9:44 AM, Jun Rao < > j...@confluent.io <javascript:;>> > > > > > wrote: > > > > > > > >>>>> > > > > > > > >>>>>> Just to elaborate on what Ben said why we need > throttling > > on > > > > > both > > > > > > > >> the > > > > > > > >>>>>> leader and the follower side. > > > > > > > >>>>>> > > > > > > > >>>>>> If we only have throttling on the follower side, > consider > > a > > > > case > > > > > > > >> that > > > > > > > >>>> we > > > > > > > >>>>>> add 5 more new brokers and want to move some replicas > from > > > > > > existing > > > > > > > >>>>> brokers > > > > > > > >>>>>> over to those 5 brokers. Each of those broker is going > to > > > > fetch > > > > > > > >> data > > > > > > > >>>> from > > > > > > > >>>>>> all existing brokers. Then, it's possible that the > > > aggregated > > > > > > fetch > > > > > > > >>>> load > > > > > > > >>>>>> from those 5 brokers on a particular existing broker > > exceeds > > > > its > > > > > > > >>>> outgoing > > > > > > > >>>>>> network bandwidth, even though the inbounding traffic on > > > each > > > > of > > > > > > > >>> those > > > > > > > >>>> 5 > > > > > > > >>>>>> brokers is bounded. > > > > > > > >>>>>> > > > > > > > >>>>>> If we only have throttling on the leader side, consider > > the > > > > same > > > > > > > >>>> example > > > > > > > >>>>>> above. It's possible for the incoming traffic to each of > > > > those 5 > > > > > > > >>>> brokers > > > > > > > >>>>> to > > > > > > > >>>>>> exceed its network bandwidth since it is fetching data > > from > > > > all > > > > > > > >>>> existing > > > > > > > >>>>>> brokers. > > > > > > > >>>>>> > > > > > > > >>>>>> So, being able to set a quota on both the follower and > the > > > > > leader > > > > > > > >>> side > > > > > > > >>>>>> protects both cases. > > > > > > > >>>>>> > > > > > > > >>>>>> Thanks, > > > > > > > >>>>>> > > > > > > > >>>>>> Jun > > > > > > > >>>>>> > > > > > > > >>>>>> On Tue, Aug 9, 2016 at 4:43 AM, Ben Stopford < > > > > b...@confluent.io <javascript:;>> > > > > > > > >>> wrote: > > > > > > > >>>>>> > > > > > > > >>>>>>> Hi Joel > > > > > > > >>>>>>> > > > > > > > >>>>>>> Thanks for taking the time to look at this. > Appreciated. > > > > > > > >>>>>>> > > > > > > > >>>>>>> Regarding throttling on both leader and follower, this > > > > proposal > > > > > > > >>>> covers > > > > > > > >>>>> a > > > > > > > >>>>>>> more general solution which can guarantee a quota, even > > > when > > > > a > > > > > > > >>>>> rebalance > > > > > > > >>>>>>> operation produces an asymmetric profile of load. This > > > means > > > > > > > >>>>>> administrators > > > > > > > >>>>>>> don’t need to calculate the impact that a follower-only > > > quota > > > > > > > >> will > > > > > > > >>>> have > > > > > > > >>>>>> on > > > > > > > >>>>>>> the leaders they are fetching from. So for example > where > > > > > replica > > > > > > > >>>> sizes > > > > > > > >>>>>> are > > > > > > > >>>>>>> skewed or where a partial rebalance is required. > > > > > > > >>>>>>> > > > > > > > >>>>>>> Having said that, even with both leader and follower > > > quotas, > > > > > the > > > > > > > >>> use > > > > > > > >>>> of > > > > > > > >>>>>>> additional threads is actually optional. There appear > to > > be > > > > two > > > > > > > >>>> general > > > > > > > >>>>>>> approaches (1) omit partitions from fetch requests > > > > (follower) / > > > > > > > >>> fetch > > > > > > > >>>>>>> responses (leader) when they exceed their quota (2) > delay > > > > them, > > > > > > > >> as > > > > > > > >>>> the > > > > > > > >>>>>>> existing quota mechanism does, using separate fetchers. > > > Both > > > > > > > >> appear > > > > > > > >>>>>> valid, > > > > > > > >>>>>>> but with slightly different design tradeoffs. > > > > > > > >>>>>>> > > > > > > > >>>>>>> The issue with approach (1) is that it departs somewhat > > > from > > > > > the > > > > > > > >>>>> existing > > > > > > > >>>>>>> quotas implementation, and must include a notion of > > > fairness > > > > > > > >>> within, > > > > > > > >>>>> the > > > > > > > >>>>>>> now size-bounded, request and response. The issue with > > (2) > > > is > > > > > > > >>>>>> guaranteeing > > > > > > > >>>>>>> ordering of updates when replicas shift threads, but > this > > > is > > > > > > > >>> handled, > > > > > > > >>>>> for > > > > > > > >>>>>>> the most part, in the code today. > > > > > > > >>>>>>> > > > > > > > >>>>>>> I’ve updated the rejected alternatives section to make > > > this a > > > > > > > >>> little > > > > > > > >>>>>>> clearer. > > > > > > > >>>>>>> > > > > > > > >>>>>>> B > > > > > > > >>>>>>> > > > > > > > >>>>>>> > > > > > > > >>>>>>> > > > > > > > >>>>>>>> On 8 Aug 2016, at 20:38, Joel Koshy < > > jjkosh...@gmail.com <javascript:;>> > > > > > > > >> wrote: > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> Hi Ben, > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> Thanks for the detailed write-up. So the proposal > > involves > > > > > > > >>>>>>> self-throttling > > > > > > > >>>>>>>> on the fetcher side and throttling at the leader. Can > > you > > > > > > > >>> elaborate > > > > > > > >>>>> on > > > > > > > >>>>>>> the > > > > > > > >>>>>>>> reasoning that is given on the wiki: *“The throttle is > > > > applied > > > > > > > >> to > > > > > > > >>>>> both > > > > > > > >>>>>>>> leaders and followers. This allows the admin to exert > > > strong > > > > > > > >>>>> guarantees > > > > > > > >>>>>>> on > > > > > > > >>>>>>>> the throttle limit".* Is there any reason why one or > the > > > > other > > > > > > > >>>>> wouldn't > > > > > > > >>>>>>> be > > > > > > > >>>>>>>> sufficient. > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> Specifically, if we were to only do self-throttling on > > the > > > > > > > >>>> fetchers, > > > > > > > >>>>> we > > > > > > > >>>>>>>> could potentially avoid the additional replica > fetchers > > > > right? > > > > > > > >>>> i.e., > > > > > > > >>>>>> the > > > > > > > >>>>>>>> replica fetchers would maintain its quota metrics as > you > > > > > > > >> proposed > > > > > > > >>>> and > > > > > > > >>>>>>> each > > > > > > > >>>>>>>> (normal) replica fetch presents an opportunity to make > > > > > progress > > > > > > > >>> for > > > > > > > >>>>> the > > > > > > > >>>>>>>> throttled partitions as long as their effective > > > consumption > > > > > > > >> rate > > > > > > > >>> is > > > > > > > >>>>>> below > > > > > > > >>>>>>>> the quota limit. If it exceeds the consumption rate > then > > > > don’t > > > > > > > >>>>> include > > > > > > > >>>>>>> the > > > > > > > >>>>>>>> throttled partitions in the subsequent fetch requests > > > until > > > > > the > > > > > > > >>>>>> effective > > > > > > > >>>>>>>> consumption rate for those partitions returns to > within > > > the > > > > > > > >> quota > > > > > > > >>>>>>> threshold. > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> I have more questions on the proposal, but was more > > > > interested > > > > > > > >> in > > > > > > > >>>> the > > > > > > > >>>>>>> above > > > > > > > >>>>>>>> to see if it could simplify things a bit. > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> Also, can you open up access to the google-doc that > you > > > link > > > > > > > >> to? > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> Thanks, > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> Joel > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> On Mon, Aug 8, 2016 at 5:54 AM, Ben Stopford < > > > > > b...@confluent.io <javascript:;> > > > > > > > >>> > > > > > > > >>>>> wrote: > > > > > > > >>>>>>>> > > > > > > > >>>>>>>>> We’ve created KIP-73: Replication Quotas > > > > > > > >>>>>>>>> > > > > > > > >>>>>>>>> The idea is to allow an admin to throttle moving > > > replicas. > > > > > > > >> Full > > > > > > > >>>>>> details > > > > > > > >>>>>>>>> are here: > > > > > > > >>>>>>>>> > > > > > > > >>>>>>>>> https://cwiki.apache.org/ > confluence/display/KAFKA/KIP- > > > 73+ > > > > > > > >>>>>>>>> Replication+Quotas <https://cwiki.apache.org/conf > > > > > > > >>>>>>>>> luence/display/KAFKA/KIP-73+Replication+Quotas> > > > > > > > >>>>>>>>> > > > > > > > >>>>>>>>> Please take a look and let us know your thoughts. > > > > > > > >>>>>>>>> > > > > > > > >>>>>>>>> Thanks > > > > > > > >>>>>>>>> > > > > > > > >>>>>>>>> B > > > > > > > >>>>>>>>> > > > > > > > >>>>>>>>> > > > > > > > >>>>>>> > > > > > > > >>>>>>> > > > > > > > >>>>>> > > > > > > > >>>>> > > > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>> -- > > > > > > > >>>> -Regards, > > > > > > > >>>> Mayuresh R. Gharat > > > > > > > >>>> (862) 250-7125 > > > > > > > >>>> > > > > > > > >>> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > -Regards, > > > > > > Mayuresh R. Gharat > > > > > > (862) 250-7125 > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > -Regards, > > Mayuresh R. Gharat > > (862) 250-7125 > > > -- Ben Stopford