This all makes a lot of sense, and mirrors what I’m thinking as I finally took some time to really walk through scenarios around why we move partitions around.
What I’m wondering if it makes sense to have a conversation around breaking out the controller entirely, separating it from the brokers, and starting to add this intelligence into that. I don’t think anyone will disagree that the controller needs a sizable amount of work. This definitely wouldn’t be the first project to separate out the brains from the dumb worker processes. -Todd On Thu, Aug 18, 2016 at 10:53 AM, Gwen Shapira <g...@confluent.io> wrote: > Just my take, since Jun and Ben originally wanted to solve a more > general approach and I talked them out of it :) > > When we first add the feature, safety is probably most important in > getting people to adopt it - I wanted to make the feature very safe by > never throttling something admins don't want to throttle. So we > figured manual approach, while more challenging to configure, is the > safest. Admins usually know which replicas are "at risk" of taking > over and can choose to throttle them accordingly, they can build their > own integration with monitoring tools, etc. > > It feels like any "smarts" we try and build into Kafka can be done > better with external tools that can watch both Kafka traffic (with the > new metrics) and things like network and CPU monitors. > > We are open to a smarter approach in Kafka, but perhaps plan it for a > follow-up KIP? Maybe even after we have some experience with the > manual approach and how best to make throttling decisions. > Similar to what we do with choosing partitions to move around - we > started manually, admins are getting experience at how they like to > choose replicas and then we can bake their expertise into the product. > > Gwen > > On Thu, Aug 18, 2016 at 10:29 AM, Jun Rao <j...@confluent.io> wrote: > > Joel, > > > > Yes, for your second comment. The tricky thing is still to figure out > which > > replicas to throttle and by how much since in general, admins probably > > don't want already in-sync or close to in-sync replicas to be throttled. > It > > would be great to get Todd's opinion on this. Could you ping him? > > > > Yes, we'd be happy to discuss auto-detection of effect traffic more > offline. > > > > Thanks, > > > > Jun > > > > On Thu, Aug 18, 2016 at 10:21 AM, Joel Koshy <jjkosh...@gmail.com> > wrote: > > > >> > For your first comment. We thought about determining "effect" replicas > >> > automatically as well. First, there are some tricky stuff that one > has to > >> > > >> > >> Auto-detection of effect traffic: i'm fairly certain it's doable but > >> definitely tricky. I'm also not sure it is something worth tackling at > the > >> outset. If we want to spend more time thinking over it even if it's > just an > >> academic exercise I would be happy to brainstorm offline. > >> > >> > >> > For your second comment, we discussed that in the client quotas > design. A > >> > down side of that for client quotas is that a client may be surprised > >> that > >> > its traffic is not throttled at one time, but throttled as another > with > >> the > >> > same quota (basically, less predicability). You can imaging setting a > >> quota > >> > for all replication traffic and only slow down the "effect" replicas > if > >> > needed. The thought is more or less the same as the above. It requires > >> more > >> > > >> > >> For clients, this is true. I think this is much less of an issue for > >> server-side replication since the "users" here are the Kafka SREs who > >> generally know these internal details. > >> > >> I think it would be valuable to get some feedback from SREs on the > proposal > >> before proceeding to a vote. (ping Todd) > >> > >> Joel > >> > >> > >> > > >> > On Thu, Aug 18, 2016 at 9:37 AM, Ben Stopford <b...@confluent.io> > wrote: > >> > > >> > > Hi Joel > >> > > > >> > > Ha! yes we had some similar thoughts, on both counts. Both are > actually > >> > > good approaches, but come with some extra complexity. > >> > > > >> > > Segregating the replication type is tempting as it creates a more > >> general > >> > > solution. One issue is you need to draw a line between lagging and > not > >> > > lagging. The ISR ‘limit' is a tempting divider, but has the side > effect > >> > > that, once you drop out you get immediately throttled. Adding a > >> > > configurable divider is another option, but difficult for admins to > >> set, > >> > > and always a little arbitrary. A better idea is to prioritise, in > >> reverse > >> > > order to lag. But that also comes with additional complexity of its > >> own. > >> > > > >> > > Under throttling is also a tempting addition. That’s to say, if > there’s > >> > > idle bandwidth lying around, not being used, why not use it to let > >> > lagging > >> > > brokers catch up. This involves some comparison to the maximum > >> bandwidth, > >> > > which could be configurable, or could be derived, with pros and cons > >> for > >> > > each. > >> > > > >> > > But the more general problem is actually quite hard to reason > about, so > >> > > after some discussion we decided to settle on something simple, > that we > >> > > felt we could get working, and extend to add these additional > features > >> as > >> > > subsequent KIPs. > >> > > > >> > > I hope that seems reasonable. Jun may wish to add to this. > >> > > > >> > > B > >> > > > >> > > > >> > > > On 18 Aug 2016, at 06:56, Joel Koshy <jjkosh...@gmail.com> wrote: > >> > > > > >> > > > On Wed, Aug 17, 2016 at 9:13 PM, Ben Stopford <b...@confluent.io> > >> > wrote: > >> > > > > >> > > >> > >> > > >> Let's us know if you have any further thoughts on KIP-73, else > we'll > >> > > kick > >> > > >> off a vote. > >> > > >> > >> > > > > >> > > > I think the mechanism for throttling replicas looks good. Just > had a > >> > few > >> > > > more thoughts on the configuration section. What you have looks > >> > > reasonable, > >> > > > but I was wondering if it could be made simpler. You probably > thought > >> > > > through these, so I'm curious to know your take. > >> > > > > >> > > > My guess is that most of the time, users would want to throttle > all > >> > > effect > >> > > > replication - due to partition reassignments, adding brokers or a > >> > broker > >> > > > coming back online after an extended period of time. In all these > >> > > scenarios > >> > > > it may be possible to distinguish bootstrap (effect) vs normal > >> > > replication > >> > > > - based on how far the replica has to catch up. I'm wondering if > it > >> is > >> > > > enough to just set an umbrella "effect" replication quota with > >> perhaps > >> > > > per-topic overrides (say if some topics are more important than > >> others) > >> > > as > >> > > > opposed to designating throttled replicas. > >> > > > > >> > > > Also, IIRC during client-side quota discussions we had considered > the > >> > > > possibility of allowing clients to go above their quotas when > >> resources > >> > > are > >> > > > available. We ended up not doing that, but for replication > throttling > >> > it > >> > > > may make sense - i.e., to treat the quota as a soft limit. Another > >> way > >> > to > >> > > > look at it is instead of ensuring "effect replication traffic does > >> not > >> > > flow > >> > > > faster than X bytes/sec" it may be useful to instead ensure that > >> > "effect > >> > > > replication traffic only flows as slowly as necessary (so as not > to > >> > > > adversely affect normal replication traffic)." > >> > > > > >> > > > Thanks, > >> > > > > >> > > > Joel > >> > > > > >> > > >>> > >> > > >>>> On Thu, Aug 11, 2016 at 2:43 PM, Jun Rao <j...@confluent.io > >> > > >>> <javascript:;>> wrote: > >> > > >>>> > >> > > >>>>> Hi, Joel, > >> > > >>>>> > >> > > >>>>> Yes, the response size includes both throttled and unthrottled > >> > > >>> replicas. > >> > > >>>>> However, the response is only delayed up to max.wait if the > >> > response > >> > > >>> size > >> > > >>>>> is less than min.bytes, which matches the current behavior. > So, > >> > there > >> > > >>> is > >> > > >>>> no > >> > > >>>>> extra delay to due throttling, right? For replica fetchers, > the > >> > > >> default > >> > > >>>>> min.byte is 1. So, the response is only delayed if there is no > >> byte > >> > > >> in > >> > > >>>> the > >> > > >>>>> response, which is what we want. > >> > > >>>>> > >> > > >>>>> Thanks, > >> > > >>>>> > >> > > >>>>> Jun > >> > > >>>>> > >> > > >>>>> On Thu, Aug 11, 2016 at 11:53 AM, Joel Koshy < > >> jjkosh...@gmail.com > >> > > >>> <javascript:;>> > >> > > >>>> wrote: > >> > > >>>>> > >> > > >>>>>> Hi Jun, > >> > > >>>>>> > >> > > >>>>>> I'm not sure that would work unless we have separate replica > >> > > >>> fetchers, > >> > > >>>>>> since this would cause all replicas (including ones that are > not > >> > > >>>>> throttled) > >> > > >>>>>> to get delayed. Instead, we could just have the leader > populate > >> > the > >> > > >>>>>> throttle-time field of the response as a hint to the > follower as > >> > to > >> > > >>> how > >> > > >>>>>> long it should wait before it adds those replicas back to its > >> > > >>>> subsequent > >> > > >>>>>> replica fetch requests. > >> > > >>>>>> > >> > > >>>>>> Thanks, > >> > > >>>>>> > >> > > >>>>>> Joel > >> > > >>>>>> > >> > > >>>>>> On Thu, Aug 11, 2016 at 9:50 AM, Jun Rao <j...@confluent.io > >> > > >>> <javascript:;>> wrote: > >> > > >>>>>> > >> > > >>>>>>> Mayuresh, > >> > > >>>>>>> > >> > > >>>>>>> That's a good question. I think if the response size (after > >> > > >> leader > >> > > >>>>>>> throttling) is smaller than min.bytes, we will just delay > the > >> > > >>> sending > >> > > >>>>> of > >> > > >>>>>>> the response up to max.wait as we do now. This should > prevent > >> > > >>>> frequent > >> > > >>>>>>> empty responses to the follower. > >> > > >>>>>>> > >> > > >>>>>>> Thanks, > >> > > >>>>>>> > >> > > >>>>>>> Jun > >> > > >>>>>>> > >> > > >>>>>>> On Wed, Aug 10, 2016 at 9:17 PM, Mayuresh Gharat < > >> > > >>>>>>> gharatmayures...@gmail.com <javascript:;> > >> > > >>>>>>>> wrote: > >> > > >>>>>>> > >> > > >>>>>>>> This might have been answered before. > >> > > >>>>>>>> I was wondering when the leader quota is reached and it > sends > >> > > >>> empty > >> > > >>>>>>>> response ( If the inclusion of a partition, listed in the > >> > > >>> leader's > >> > > >>>>>>>> throttled-replicas list, causes the LeaderQuotaRate to be > >> > > >>> exceeded, > >> > > >>>>>> that > >> > > >>>>>>>> partition is omitted from the response (aka returns 0 > >> bytes).). > >> > > >>> At > >> > > >>>>> this > >> > > >>>>>>>> point the follower quota is NOT reached and the follower is > >> > > >> still > >> > > >>>>> going > >> > > >>>>>>> to > >> > > >>>>>>>> ask for the that partition in the next fetch request. > Would it > >> > > >> be > >> > > >>>>> fair > >> > > >>>>>> to > >> > > >>>>>>>> add some logic there so that the follower backs off ( for > some > >> > > >>>>>>> configurable > >> > > >>>>>>>> time) from including those partitions in the next fetch > >> > > >> request? > >> > > >>>>>>>> > >> > > >>>>>>>> Thanks, > >> > > >>>>>>>> > >> > > >>>>>>>> Mayuresh > >> > > >>>>>>>> > >> > > >>>>>>>> On Wed, Aug 10, 2016 at 8:06 AM, Ben Stopford < > >> > > >> b...@confluent.io > >> > > >>> <javascript:;>> > >> > > >>>>>> wrote: > >> > > >>>>>>>> > >> > > >>>>>>>>> Thanks again for the responses everyone. I’ve removed the > the > >> > > >>>> extra > >> > > >>>>>>>>> fetcher threads from the proposal, switching to the > >> > > >>>> inclusion-based > >> > > >>>>>>>>> approach. The relevant section is: > >> > > >>>>>>>>> > >> > > >>>>>>>>> The follower makes a requests, using the fixed size of > >> > > >>>>>>>>> replica.fetch.response.max.bytes as per KIP-74 < > >> > > >>>>>>>> https://cwiki.apache.org/ > >> > > >>>>>>>>> confluence/display/KAFKA/KIP- > 74%3A+Add+Fetch+Response+Size+ > >> > > >>>>>>>> Limit+in+Bytes>. > >> > > >>>>>>>>> The order of the partitions in the fetch request are > >> > > >> randomised > >> > > >>>> to > >> > > >>>>>>> ensure > >> > > >>>>>>>>> fairness. > >> > > >>>>>>>>> When the leader receives the fetch request it processes > the > >> > > >>>>>> partitions > >> > > >>>>>>> in > >> > > >>>>>>>>> the defined order, up to the response's size limit. If the > >> > > >>>>> inclusion > >> > > >>>>>>> of a > >> > > >>>>>>>>> partition, listed in the leader's throttled-replicas list, > >> > > >>> causes > >> > > >>>>> the > >> > > >>>>>>>>> LeaderQuotaRate to be exceeded, that partition is omitted > >> > > >> from > >> > > >>>> the > >> > > >>>>>>>> response > >> > > >>>>>>>>> (aka returns 0 bytes). Logically, this is of the form: > >> > > >>>>>>>>> var bytesAllowedForThrottledPartition = > >> > > >>>>> quota.recordAndMaybeAdjust( > >> > > >>>>>>>>> bytesRequestedForPartition) > >> > > >>>>>>>>> When the follower receives the fetch response, if it > includes > >> > > >>>>>>> partitions > >> > > >>>>>>>>> in its throttled-partitions list, it increments the > >> > > >>>>>> FollowerQuotaRate: > >> > > >>>>>>>>> var includeThrottledPartitionsInNextRequest: Boolean = > >> > > >>>>>>>>> quota.recordAndEvaluate(previousResponseThrottledBytes) > >> > > >>>>>>>>> If the quota is exceeded, no throttled partitions will be > >> > > >>>> included > >> > > >>>>> in > >> > > >>>>>>> the > >> > > >>>>>>>>> next fetch request emitted by this replica fetcher thread. > >> > > >>>>>>>>> > >> > > >>>>>>>>> B > >> > > >>>>>>>>> > >> > > >>>>>>>>>> On 9 Aug 2016, at 23:34, Jun Rao <j...@confluent.io > >> > > >>> <javascript:;>> wrote: > >> > > >>>>>>>>>> > >> > > >>>>>>>>>> When there are several unthrottled replicas, we could > also > >> > > >>> just > >> > > >>>>> do > >> > > >>>>>>>> what's > >> > > >>>>>>>>>> suggested in KIP-74. The client is responsible for > >> > > >> reordering > >> > > >>>> the > >> > > >>>>>>>>>> partitions and the leader fills in the bytes to those > >> > > >>>> partitions > >> > > >>>>> in > >> > > >>>>>>>>> order, > >> > > >>>>>>>>>> up to the quota limit. > >> > > >>>>>>>>>> > >> > > >>>>>>>>>> We could also do what you suggested. If quota is > exceeded, > >> > > >>>>> include > >> > > >>>>>>>> empty > >> > > >>>>>>>>>> data in the response for throttled replicas. Keep doing > >> > > >> that > >> > > >>>>> until > >> > > >>>>>>>> enough > >> > > >>>>>>>>>> time has passed so that the quota is no longer exceeded. > >> > > >> This > >> > > >>>>>>>> potentially > >> > > >>>>>>>>>> allows better batching per partition. Not sure if the two > >> > > >>>> makes a > >> > > >>>>>> big > >> > > >>>>>>>>>> difference in practice though. > >> > > >>>>>>>>>> > >> > > >>>>>>>>>> Thanks, > >> > > >>>>>>>>>> > >> > > >>>>>>>>>> Jun > >> > > >>>>>>>>>> > >> > > >>>>>>>>>> > >> > > >>>>>>>>>> On Tue, Aug 9, 2016 at 2:31 PM, Joel Koshy < > >> > > >>>> jjkosh...@gmail.com <javascript:;>> > >> > > >>>>>>>> wrote: > >> > > >>>>>>>>>> > >> > > >>>>>>>>>>>> > >> > > >>>>>>>>>>>> > >> > > >>>>>>>>>>>> > >> > > >>>>>>>>>>>> On the leader side, one challenge is related to the > >> > > >>> fairness > >> > > >>>>>> issue > >> > > >>>>>>>> that > >> > > >>>>>>>>>>> Ben > >> > > >>>>>>>>>>>> brought up. The question is what if the fetch response > >> > > >>> limit > >> > > >>>> is > >> > > >>>>>>>> filled > >> > > >>>>>>>>> up > >> > > >>>>>>>>>>>> by the throttled replicas? If this happens constantly, > we > >> > > >>>> will > >> > > >>>>>>> delay > >> > > >>>>>>>>> the > >> > > >>>>>>>>>>>> progress of those un-throttled replicas. However, I > think > >> > > >>> we > >> > > >>>>> can > >> > > >>>>>>>>> address > >> > > >>>>>>>>>>>> this issue by trying to fill up the unthrottled > replicas > >> > > >> in > >> > > >>>> the > >> > > >>>>>>>>> response > >> > > >>>>>>>>>>>> first. So, the algorithm would be. Fill up unthrottled > >> > > >>>> replicas > >> > > >>>>>> up > >> > > >>>>>>> to > >> > > >>>>>>>>> the > >> > > >>>>>>>>>>>> fetch response limit. If there is space left, fill up > >> > > >>>> throttled > >> > > >>>>>>>>> replicas. > >> > > >>>>>>>>>>>> If quota is exceeded for the throttled replicas, reduce > >> > > >> the > >> > > >>>>> bytes > >> > > >>>>>>> in > >> > > >>>>>>>>> the > >> > > >>>>>>>>>>>> throttled replicas in the response accordingly. > >> > > >>>>>>>>>>>> > >> > > >>>>>>>>>>> > >> > > >>>>>>>>>>> Right - that's what I was trying to convey by truncation > >> > > >> (vs > >> > > >>>>>> empty). > >> > > >>>>>>>> So > >> > > >>>>>>>>> we > >> > > >>>>>>>>>>> would attempt to fill the response for throttled > >> > > >> partitions > >> > > >>> as > >> > > >>>>>> much > >> > > >>>>>>> as > >> > > >>>>>>>>> we > >> > > >>>>>>>>>>> can before hitting the quota limit. There is one more > >> > > >> detail > >> > > >>>> to > >> > > >>>>>>> handle > >> > > >>>>>>>>> in > >> > > >>>>>>>>>>> this: if there are several throttled partitions and not > >> > > >>> enough > >> > > >>>>>>>> remaining > >> > > >>>>>>>>>>> allowance in the fetch response to include all the > >> > > >> throttled > >> > > >>>>>>> replicas > >> > > >>>>>>>>> then > >> > > >>>>>>>>>>> we would need to decide which of those partitions get a > >> > > >>> share; > >> > > >>>>>> which > >> > > >>>>>>>> is > >> > > >>>>>>>>> why > >> > > >>>>>>>>>>> I'm wondering if it is easier to return empty for those > >> > > >>>>> partitions > >> > > >>>>>>>>> entirely > >> > > >>>>>>>>>>> in the fetch response - they will make progress in the > >> > > >>>>> subsequent > >> > > >>>>>>>>> fetch. If > >> > > >>>>>>>>>>> they don't make fast enough progress then that would be > a > >> > > >>> case > >> > > >>>>> for > >> > > >>>>>>>>> raising > >> > > >>>>>>>>>>> the threshold or letting it complete at an off-peak > time. > >> > > >>>>>>>>>>> > >> > > >>>>>>>>>>> > >> > > >>>>>>>>>>>> > >> > > >>>>>>>>>>>> With this approach, we need some new logic to handle > >> > > >>>> throttling > >> > > >>>>>> on > >> > > >>>>>>>> the > >> > > >>>>>>>>>>>> leader, but we can leave the replica threading model > >> > > >>>> unchanged. > >> > > >>>>>> So, > >> > > >>>>>>>>>>>> overall, this still seems to be a simpler approach. > >> > > >>>>>>>>>>>> > >> > > >>>>>>>>>>>> Thanks, > >> > > >>>>>>>>>>>> > >> > > >>>>>>>>>>>> Jun > >> > > >>>>>>>>>>>> > >> > > >>>>>>>>>>>> On Tue, Aug 9, 2016 at 11:57 AM, Mayuresh Gharat < > >> > > >>>>>>>>>>>> gharatmayures...@gmail.com <javascript:;> > >> > > >>>>>>>>>>>>> wrote: > >> > > >>>>>>>>>>>> > >> > > >>>>>>>>>>>>> Nice write up Ben. > >> > > >>>>>>>>>>>>> > >> > > >>>>>>>>>>>>> I agree with Joel for keeping this simple by excluding > >> > > >> the > >> > > >>>>>>>> partitions > >> > > >>>>>>>>>>>> from > >> > > >>>>>>>>>>>>> the fetch request/response when the quota is violated > at > >> > > >>> the > >> > > >>>>>>>> follower > >> > > >>>>>>>>>>> or > >> > > >>>>>>>>>>>>> leader instead of having a separate set of threads for > >> > > >>>>> handling > >> > > >>>>>>> the > >> > > >>>>>>>>>>> quota > >> > > >>>>>>>>>>>>> and non quota cases. Even though its different from > the > >> > > >>>>> current > >> > > >>>>>>>> quota > >> > > >>>>>>>>>>>>> implementation it should be OK since its internal to > >> > > >>> brokers > >> > > >>>>> and > >> > > >>>>>>> can > >> > > >>>>>>>>> be > >> > > >>>>>>>>>>>>> handled by tuning the quota configs for it > appropriately > >> > > >>> by > >> > > >>>>> the > >> > > >>>>>>>>> admins. > >> > > >>>>>>>>>>>>> > >> > > >>>>>>>>>>>>> Also can you elaborate with an example how this would > be > >> > > >>>>>> handled : > >> > > >>>>>>>>>>>>> *guaranteeing > >> > > >>>>>>>>>>>>> ordering of updates when replicas shift threads* > >> > > >>>>>>>>>>>>> > >> > > >>>>>>>>>>>>> Thanks, > >> > > >>>>>>>>>>>>> > >> > > >>>>>>>>>>>>> Mayuresh > >> > > >>>>>>>>>>>>> > >> > > >>>>>>>>>>>>> > >> > > >>>>>>>>>>>>> On Tue, Aug 9, 2016 at 10:49 AM, Joel Koshy < > >> > > >>>>>> jjkosh...@gmail.com <javascript:;>> > >> > > >>>>>>>>>>> wrote: > >> > > >>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>> On the need for both leader/follower throttling: that > >> > > >>> makes > >> > > >>>>>>> sense - > >> > > >>>>>>>>>>>>> thanks > >> > > >>>>>>>>>>>>>> for clarifying. For completeness, can we add this > >> > > >> detail > >> > > >>> to > >> > > >>>>> the > >> > > >>>>>>>> doc - > >> > > >>>>>>>>>>>>> say, > >> > > >>>>>>>>>>>>>> after the quote that I pasted earlier? > >> > > >>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>> From an implementation perspective though: I’m still > >> > > >>>>> interested > >> > > >>>>>>> in > >> > > >>>>>>>>>>> the > >> > > >>>>>>>>>>>>>> simplicity of not having to add separate replica > >> > > >>> fetchers, > >> > > >>>>>> delay > >> > > >>>>>>>>>>> queue > >> > > >>>>>>>>>>>> on > >> > > >>>>>>>>>>>>>> the leader, and “move” partitions from the throttled > >> > > >>>> replica > >> > > >>>>>>>> fetchers > >> > > >>>>>>>>>>>> to > >> > > >>>>>>>>>>>>>> the regular replica fetchers once caught up. > >> > > >>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>> Instead, I think it would work and be simpler to > >> > > >> include > >> > > >>> or > >> > > >>>>>>> exclude > >> > > >>>>>>>>>>> the > >> > > >>>>>>>>>>>>>> partitions in the fetch request from the follower and > >> > > >>> fetch > >> > > >>>>>>>> response > >> > > >>>>>>>>>>>> from > >> > > >>>>>>>>>>>>>> the leader when the quota is violated. The issue of > >> > > >>>> fairness > >> > > >>>>>> that > >> > > >>>>>>>> Ben > >> > > >>>>>>>>>>>>> noted > >> > > >>>>>>>>>>>>>> may be a wash between the two options (that Ben wrote > >> > > >> in > >> > > >>>> his > >> > > >>>>>>>> email). > >> > > >>>>>>>>>>>> With > >> > > >>>>>>>>>>>>>> the default quota delay mechanism, partitions get > >> > > >> delayed > >> > > >>>>>>>> essentially > >> > > >>>>>>>>>>>> at > >> > > >>>>>>>>>>>>>> random - i.e., whoever fetches at the time of quota > >> > > >>>> violation > >> > > >>>>>>> gets > >> > > >>>>>>>>>>>>> delayed > >> > > >>>>>>>>>>>>>> at the leader. So we can adopt a similar policy in > >> > > >>> choosing > >> > > >>>>> to > >> > > >>>>>>>>>>> truncate > >> > > >>>>>>>>>>>>>> partitions in fetch responses. i.e., if at the time > of > >> > > >>>>> handling > >> > > >>>>>>> the > >> > > >>>>>>>>>>>> fetch > >> > > >>>>>>>>>>>>>> the “effect” replication rate exceeds the quota then > >> > > >>> either > >> > > >>>>>> empty > >> > > >>>>>>>> or > >> > > >>>>>>>>>>>>>> truncate those partitions from the response. (BTW > >> > > >> effect > >> > > >>>>>>>> replication > >> > > >>>>>>>>>>> is > >> > > >>>>>>>>>>>>>> your terminology in the wiki - i.e., replication due > to > >> > > >>>>>> partition > >> > > >>>>>>>>>>>>>> reassignment, adding brokers, etc.) > >> > > >>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>> While this may be slightly different from the > existing > >> > > >>>> quota > >> > > >>>>>>>>>>> mechanism > >> > > >>>>>>>>>>>> I > >> > > >>>>>>>>>>>>>> think the difference is small (since we would reuse > the > >> > > >>>> quota > >> > > >>>>>>>> manager > >> > > >>>>>>>>>>>> at > >> > > >>>>>>>>>>>>>> worst with some refactoring) and will be internal to > >> > > >> the > >> > > >>>>>> broker. > >> > > >>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>> So I guess the question is if this alternative is > >> > > >> simpler > >> > > >>>>>> enough > >> > > >>>>>>>> and > >> > > >>>>>>>>>>>>>> equally functional to not go with dedicated throttled > >> > > >>>> replica > >> > > >>>>>>>>>>> fetchers. > >> > > >>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>> On Tue, Aug 9, 2016 at 9:44 AM, Jun Rao < > >> > > >>> j...@confluent.io <javascript:;>> > >> > > >>>>>>> wrote: > >> > > >>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>>> Just to elaborate on what Ben said why we need > >> > > >>> throttling > >> > > >>>> on > >> > > >>>>>>> both > >> > > >>>>>>>>>>> the > >> > > >>>>>>>>>>>>>>> leader and the follower side. > >> > > >>>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>>> If we only have throttling on the follower side, > >> > > >>> consider > >> > > >>>> a > >> > > >>>>>> case > >> > > >>>>>>>>>>> that > >> > > >>>>>>>>>>>>> we > >> > > >>>>>>>>>>>>>>> add 5 more new brokers and want to move some > replicas > >> > > >>> from > >> > > >>>>>>>> existing > >> > > >>>>>>>>>>>>>> brokers > >> > > >>>>>>>>>>>>>>> over to those 5 brokers. Each of those broker is > going > >> > > >>> to > >> > > >>>>>> fetch > >> > > >>>>>>>>>>> data > >> > > >>>>>>>>>>>>> from > >> > > >>>>>>>>>>>>>>> all existing brokers. Then, it's possible that the > >> > > >>>>> aggregated > >> > > >>>>>>>> fetch > >> > > >>>>>>>>>>>>> load > >> > > >>>>>>>>>>>>>>> from those 5 brokers on a particular existing broker > >> > > >>>> exceeds > >> > > >>>>>> its > >> > > >>>>>>>>>>>>> outgoing > >> > > >>>>>>>>>>>>>>> network bandwidth, even though the inbounding > traffic > >> > > >> on > >> > > >>>>> each > >> > > >>>>>> of > >> > > >>>>>>>>>>>> those > >> > > >>>>>>>>>>>>> 5 > >> > > >>>>>>>>>>>>>>> brokers is bounded. > >> > > >>>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>>> If we only have throttling on the leader side, > >> > > >> consider > >> > > >>>> the > >> > > >>>>>> same > >> > > >>>>>>>>>>>>> example > >> > > >>>>>>>>>>>>>>> above. It's possible for the incoming traffic to > each > >> > > >> of > >> > > >>>>>> those 5 > >> > > >>>>>>>>>>>>> brokers > >> > > >>>>>>>>>>>>>> to > >> > > >>>>>>>>>>>>>>> exceed its network bandwidth since it is fetching > data > >> > > >>>> from > >> > > >>>>>> all > >> > > >>>>>>>>>>>>> existing > >> > > >>>>>>>>>>>>>>> brokers. > >> > > >>>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>>> So, being able to set a quota on both the follower > and > >> > > >>> the > >> > > >>>>>>> leader > >> > > >>>>>>>>>>>> side > >> > > >>>>>>>>>>>>>>> protects both cases. > >> > > >>>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>>> Thanks, > >> > > >>>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>>> Jun > >> > > >>>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>>> On Tue, Aug 9, 2016 at 4:43 AM, Ben Stopford < > >> > > >>>>>> b...@confluent.io <javascript:;>> > >> > > >>>>>>>>>>>> wrote: > >> > > >>>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>>>> Hi Joel > >> > > >>>>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>>>> Thanks for taking the time to look at this. > >> > > >>> Appreciated. > >> > > >>>>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>>>> Regarding throttling on both leader and follower, > >> > > >> this > >> > > >>>>>> proposal > >> > > >>>>>>>>>>>>> covers > >> > > >>>>>>>>>>>>>> a > >> > > >>>>>>>>>>>>>>>> more general solution which can guarantee a quota, > >> > > >> even > >> > > >>>>> when > >> > > >>>>>> a > >> > > >>>>>>>>>>>>>> rebalance > >> > > >>>>>>>>>>>>>>>> operation produces an asymmetric profile of load. > >> > > >> This > >> > > >>>>> means > >> > > >>>>>>>>>>>>>>> administrators > >> > > >>>>>>>>>>>>>>>> don’t need to calculate the impact that a > >> > > >> follower-only > >> > > >>>>> quota > >> > > >>>>>>>>>>> will > >> > > >>>>>>>>>>>>> have > >> > > >>>>>>>>>>>>>>> on > >> > > >>>>>>>>>>>>>>>> the leaders they are fetching from. So for example > >> > > >>> where > >> > > >>>>>>> replica > >> > > >>>>>>>>>>>>> sizes > >> > > >>>>>>>>>>>>>>> are > >> > > >>>>>>>>>>>>>>>> skewed or where a partial rebalance is required. > >> > > >>>>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>>>> Having said that, even with both leader and > follower > >> > > >>>>> quotas, > >> > > >>>>>>> the > >> > > >>>>>>>>>>>> use > >> > > >>>>>>>>>>>>> of > >> > > >>>>>>>>>>>>>>>> additional threads is actually optional. There > appear > >> > > >>> to > >> > > >>>> be > >> > > >>>>>> two > >> > > >>>>>>>>>>>>> general > >> > > >>>>>>>>>>>>>>>> approaches (1) omit partitions from fetch requests > >> > > >>>>>> (follower) / > >> > > >>>>>>>>>>>> fetch > >> > > >>>>>>>>>>>>>>>> responses (leader) when they exceed their quota (2) > >> > > >>> delay > >> > > >>>>>> them, > >> > > >>>>>>>>>>> as > >> > > >>>>>>>>>>>>> the > >> > > >>>>>>>>>>>>>>>> existing quota mechanism does, using separate > >> > > >> fetchers. > >> > > >>>>> Both > >> > > >>>>>>>>>>> appear > >> > > >>>>>>>>>>>>>>> valid, > >> > > >>>>>>>>>>>>>>>> but with slightly different design tradeoffs. > >> > > >>>>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>>>> The issue with approach (1) is that it departs > >> > > >> somewhat > >> > > >>>>> from > >> > > >>>>>>> the > >> > > >>>>>>>>>>>>>> existing > >> > > >>>>>>>>>>>>>>>> quotas implementation, and must include a notion of > >> > > >>>>> fairness > >> > > >>>>>>>>>>>> within, > >> > > >>>>>>>>>>>>>> the > >> > > >>>>>>>>>>>>>>>> now size-bounded, request and response. The issue > >> > > >> with > >> > > >>>> (2) > >> > > >>>>> is > >> > > >>>>>>>>>>>>>>> guaranteeing > >> > > >>>>>>>>>>>>>>>> ordering of updates when replicas shift threads, > but > >> > > >>> this > >> > > >>>>> is > >> > > >>>>>>>>>>>> handled, > >> > > >>>>>>>>>>>>>> for > >> > > >>>>>>>>>>>>>>>> the most part, in the code today. > >> > > >>>>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>>>> I’ve updated the rejected alternatives section to > >> > > >> make > >> > > >>>>> this a > >> > > >>>>>>>>>>>> little > >> > > >>>>>>>>>>>>>>>> clearer. > >> > > >>>>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>>>> B > >> > > >>>>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>>>>> On 8 Aug 2016, at 20:38, Joel Koshy < > >> > > >>>> jjkosh...@gmail.com <javascript:;>> > >> > > >>>>>>>>>>> wrote: > >> > > >>>>>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>>>>> Hi Ben, > >> > > >>>>>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>>>>> Thanks for the detailed write-up. So the proposal > >> > > >>>> involves > >> > > >>>>>>>>>>>>>>>> self-throttling > >> > > >>>>>>>>>>>>>>>>> on the fetcher side and throttling at the leader. > >> > > >> Can > >> > > >>>> you > >> > > >>>>>>>>>>>> elaborate > >> > > >>>>>>>>>>>>>> on > >> > > >>>>>>>>>>>>>>>> the > >> > > >>>>>>>>>>>>>>>>> reasoning that is given on the wiki: *“The > throttle > >> > > >> is > >> > > >>>>>> applied > >> > > >>>>>>>>>>> to > >> > > >>>>>>>>>>>>>> both > >> > > >>>>>>>>>>>>>>>>> leaders and followers. This allows the admin to > >> > > >> exert > >> > > >>>>> strong > >> > > >>>>>>>>>>>>>> guarantees > >> > > >>>>>>>>>>>>>>>> on > >> > > >>>>>>>>>>>>>>>>> the throttle limit".* Is there any reason why one > or > >> > > >>> the > >> > > >>>>>> other > >> > > >>>>>>>>>>>>>> wouldn't > >> > > >>>>>>>>>>>>>>>> be > >> > > >>>>>>>>>>>>>>>>> sufficient. > >> > > >>>>>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>>>>> Specifically, if we were to only do > self-throttling > >> > > >> on > >> > > >>>> the > >> > > >>>>>>>>>>>>> fetchers, > >> > > >>>>>>>>>>>>>> we > >> > > >>>>>>>>>>>>>>>>> could potentially avoid the additional replica > >> > > >>> fetchers > >> > > >>>>>> right? > >> > > >>>>>>>>>>>>> i.e., > >> > > >>>>>>>>>>>>>>> the > >> > > >>>>>>>>>>>>>>>>> replica fetchers would maintain its quota metrics > as > >> > > >>> you > >> > > >>>>>>>>>>> proposed > >> > > >>>>>>>>>>>>> and > >> > > >>>>>>>>>>>>>>>> each > >> > > >>>>>>>>>>>>>>>>> (normal) replica fetch presents an opportunity to > >> > > >> make > >> > > >>>>>>> progress > >> > > >>>>>>>>>>>> for > >> > > >>>>>>>>>>>>>> the > >> > > >>>>>>>>>>>>>>>>> throttled partitions as long as their effective > >> > > >>>>> consumption > >> > > >>>>>>>>>>> rate > >> > > >>>>>>>>>>>> is > >> > > >>>>>>>>>>>>>>> below > >> > > >>>>>>>>>>>>>>>>> the quota limit. If it exceeds the consumption > rate > >> > > >>> then > >> > > >>>>>> don’t > >> > > >>>>>>>>>>>>>> include > >> > > >>>>>>>>>>>>>>>> the > >> > > >>>>>>>>>>>>>>>>> throttled partitions in the subsequent fetch > >> > > >> requests > >> > > >>>>> until > >> > > >>>>>>> the > >> > > >>>>>>>>>>>>>>> effective > >> > > >>>>>>>>>>>>>>>>> consumption rate for those partitions returns to > >> > > >>> within > >> > > >>>>> the > >> > > >>>>>>>>>>> quota > >> > > >>>>>>>>>>>>>>>> threshold. > >> > > >>>>>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>>>>> I have more questions on the proposal, but was > more > >> > > >>>>>> interested > >> > > >>>>>>>>>>> in > >> > > >>>>>>>>>>>>> the > >> > > >>>>>>>>>>>>>>>> above > >> > > >>>>>>>>>>>>>>>>> to see if it could simplify things a bit. > >> > > >>>>>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>>>>> Also, can you open up access to the google-doc > that > >> > > >>> you > >> > > >>>>> link > >> > > >>>>>>>>>>> to? > >> > > >>>>>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>>>>> Thanks, > >> > > >>>>>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>>>>> Joel > >> > > >>>>>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>>>>> On Mon, Aug 8, 2016 at 5:54 AM, Ben Stopford < > >> > > >>>>>>> b...@confluent.io <javascript:;> > >> > > >>>>>>>>>>>> > >> > > >>>>>>>>>>>>>> wrote: > >> > > >>>>>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>>>>>> We’ve created KIP-73: Replication Quotas > >> > > >>>>>>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>>>>>> The idea is to allow an admin to throttle moving > >> > > >>>>> replicas. > >> > > >>>>>>>>>>> Full > >> > > >>>>>>>>>>>>>>> details > >> > > >>>>>>>>>>>>>>>>>> are here: > >> > > >>>>>>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>>>>>> https://cwiki.apache.org/ > >> > > >>> confluence/display/KAFKA/KIP- > >> > > >>>>> 73+ > >> > > >>>>>>>>>>>>>>>>>> Replication+Quotas < > https://cwiki.apache.org/conf > >> > > >>>>>>>>>>>>>>>>>> luence/display/KAFKA/KIP-73+Replication+Quotas> > >> > > >>>>>>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>>>>>> Please take a look and let us know your thoughts. > >> > > >>>>>>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>>>>>> Thanks > >> > > >>>>>>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>>>>>> B > >> > > >>>>>>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>>> > >> > > >>>>>>>>>>>>> > >> > > >>>>>>>>>>>>> > >> > > >>>>>>>>>>>>> > >> > > >>>>>>>>>>>>> -- > >> > > >>>>>>>>>>>>> -Regards, > >> > > >>>>>>>>>>>>> Mayuresh R. Gharat > >> > > >>>>>>>>>>>>> (862) 250-7125 > >> > > >>>>>>>>>>>>> > >> > > >>>>>>>>>>>> > >> > > >>>>>>>>>>> > >> > > >>>>>>>>> > >> > > >>>>>>>>> > >> > > >>>>>>>> > >> > > >>>>>>>> > >> > > >>>>>>>> -- > >> > > >>>>>>>> -Regards, > >> > > >>>>>>>> Mayuresh R. Gharat > >> > > >>>>>>>> (862) 250-7125 > >> > > >>>>>>>> > >> > > >>>>>>> > >> > > >>>>>> > >> > > >>>>> > >> > > >>>> > >> > > >>>> > >> > > >>>> > >> > > >>>> -- > >> > > >>>> -Regards, > >> > > >>>> Mayuresh R. Gharat > >> > > >>>> (862) 250-7125 > >> > > >>>> > >> > > >>> > >> > > >> > >> > > >> > >> > > >> -- > >> > > >> Ben Stopford > >> > > >> > >> > > > >> > > > >> > > >> > > > > -- > Gwen Shapira > Product Manager | Confluent > 650.450.2760 | @gwenshap > Follow us: Twitter | blog > -- *Todd Palino* Staff Site Reliability Engineer Data Infrastructure Streaming linkedin.com/in/toddpalino