Yes, I think its a great discussion to have. There are definitely pros and cons to both approaches and worth thinking about the right way forward.
On Thu, Aug 18, 2016 at 11:03 AM, Todd Palino <tpal...@gmail.com> wrote: > This all makes a lot of sense, and mirrors what I’m thinking as I finally > took some time to really walk through scenarios around why we move > partitions around. > > What I’m wondering if it makes sense to have a conversation around breaking > out the controller entirely, separating it from the brokers, and starting > to add this intelligence into that. I don’t think anyone will disagree that > the controller needs a sizable amount of work. This definitely wouldn’t be > the first project to separate out the brains from the dumb worker processes. > > -Todd > > > On Thu, Aug 18, 2016 at 10:53 AM, Gwen Shapira <g...@confluent.io> wrote: > >> Just my take, since Jun and Ben originally wanted to solve a more >> general approach and I talked them out of it :) >> >> When we first add the feature, safety is probably most important in >> getting people to adopt it - I wanted to make the feature very safe by >> never throttling something admins don't want to throttle. So we >> figured manual approach, while more challenging to configure, is the >> safest. Admins usually know which replicas are "at risk" of taking >> over and can choose to throttle them accordingly, they can build their >> own integration with monitoring tools, etc. >> >> It feels like any "smarts" we try and build into Kafka can be done >> better with external tools that can watch both Kafka traffic (with the >> new metrics) and things like network and CPU monitors. >> >> We are open to a smarter approach in Kafka, but perhaps plan it for a >> follow-up KIP? Maybe even after we have some experience with the >> manual approach and how best to make throttling decisions. >> Similar to what we do with choosing partitions to move around - we >> started manually, admins are getting experience at how they like to >> choose replicas and then we can bake their expertise into the product. >> >> Gwen >> >> On Thu, Aug 18, 2016 at 10:29 AM, Jun Rao <j...@confluent.io> wrote: >> > Joel, >> > >> > Yes, for your second comment. The tricky thing is still to figure out >> which >> > replicas to throttle and by how much since in general, admins probably >> > don't want already in-sync or close to in-sync replicas to be throttled. >> It >> > would be great to get Todd's opinion on this. Could you ping him? >> > >> > Yes, we'd be happy to discuss auto-detection of effect traffic more >> offline. >> > >> > Thanks, >> > >> > Jun >> > >> > On Thu, Aug 18, 2016 at 10:21 AM, Joel Koshy <jjkosh...@gmail.com> >> wrote: >> > >> >> > For your first comment. We thought about determining "effect" replicas >> >> > automatically as well. First, there are some tricky stuff that one >> has to >> >> > >> >> >> >> Auto-detection of effect traffic: i'm fairly certain it's doable but >> >> definitely tricky. I'm also not sure it is something worth tackling at >> the >> >> outset. If we want to spend more time thinking over it even if it's >> just an >> >> academic exercise I would be happy to brainstorm offline. >> >> >> >> >> >> > For your second comment, we discussed that in the client quotas >> design. A >> >> > down side of that for client quotas is that a client may be surprised >> >> that >> >> > its traffic is not throttled at one time, but throttled as another >> with >> >> the >> >> > same quota (basically, less predicability). You can imaging setting a >> >> quota >> >> > for all replication traffic and only slow down the "effect" replicas >> if >> >> > needed. The thought is more or less the same as the above. It requires >> >> more >> >> > >> >> >> >> For clients, this is true. I think this is much less of an issue for >> >> server-side replication since the "users" here are the Kafka SREs who >> >> generally know these internal details. >> >> >> >> I think it would be valuable to get some feedback from SREs on the >> proposal >> >> before proceeding to a vote. (ping Todd) >> >> >> >> Joel >> >> >> >> >> >> > >> >> > On Thu, Aug 18, 2016 at 9:37 AM, Ben Stopford <b...@confluent.io> >> wrote: >> >> > >> >> > > Hi Joel >> >> > > >> >> > > Ha! yes we had some similar thoughts, on both counts. Both are >> actually >> >> > > good approaches, but come with some extra complexity. >> >> > > >> >> > > Segregating the replication type is tempting as it creates a more >> >> general >> >> > > solution. One issue is you need to draw a line between lagging and >> not >> >> > > lagging. The ISR ‘limit' is a tempting divider, but has the side >> effect >> >> > > that, once you drop out you get immediately throttled. Adding a >> >> > > configurable divider is another option, but difficult for admins to >> >> set, >> >> > > and always a little arbitrary. A better idea is to prioritise, in >> >> reverse >> >> > > order to lag. But that also comes with additional complexity of its >> >> own. >> >> > > >> >> > > Under throttling is also a tempting addition. That’s to say, if >> there’s >> >> > > idle bandwidth lying around, not being used, why not use it to let >> >> > lagging >> >> > > brokers catch up. This involves some comparison to the maximum >> >> bandwidth, >> >> > > which could be configurable, or could be derived, with pros and cons >> >> for >> >> > > each. >> >> > > >> >> > > But the more general problem is actually quite hard to reason >> about, so >> >> > > after some discussion we decided to settle on something simple, >> that we >> >> > > felt we could get working, and extend to add these additional >> features >> >> as >> >> > > subsequent KIPs. >> >> > > >> >> > > I hope that seems reasonable. Jun may wish to add to this. >> >> > > >> >> > > B >> >> > > >> >> > > >> >> > > > On 18 Aug 2016, at 06:56, Joel Koshy <jjkosh...@gmail.com> wrote: >> >> > > > >> >> > > > On Wed, Aug 17, 2016 at 9:13 PM, Ben Stopford <b...@confluent.io> >> >> > wrote: >> >> > > > >> >> > > >> >> >> > > >> Let's us know if you have any further thoughts on KIP-73, else >> we'll >> >> > > kick >> >> > > >> off a vote. >> >> > > >> >> >> > > > >> >> > > > I think the mechanism for throttling replicas looks good. Just >> had a >> >> > few >> >> > > > more thoughts on the configuration section. What you have looks >> >> > > reasonable, >> >> > > > but I was wondering if it could be made simpler. You probably >> thought >> >> > > > through these, so I'm curious to know your take. >> >> > > > >> >> > > > My guess is that most of the time, users would want to throttle >> all >> >> > > effect >> >> > > > replication - due to partition reassignments, adding brokers or a >> >> > broker >> >> > > > coming back online after an extended period of time. In all these >> >> > > scenarios >> >> > > > it may be possible to distinguish bootstrap (effect) vs normal >> >> > > replication >> >> > > > - based on how far the replica has to catch up. I'm wondering if >> it >> >> is >> >> > > > enough to just set an umbrella "effect" replication quota with >> >> perhaps >> >> > > > per-topic overrides (say if some topics are more important than >> >> others) >> >> > > as >> >> > > > opposed to designating throttled replicas. >> >> > > > >> >> > > > Also, IIRC during client-side quota discussions we had considered >> the >> >> > > > possibility of allowing clients to go above their quotas when >> >> resources >> >> > > are >> >> > > > available. We ended up not doing that, but for replication >> throttling >> >> > it >> >> > > > may make sense - i.e., to treat the quota as a soft limit. Another >> >> way >> >> > to >> >> > > > look at it is instead of ensuring "effect replication traffic does >> >> not >> >> > > flow >> >> > > > faster than X bytes/sec" it may be useful to instead ensure that >> >> > "effect >> >> > > > replication traffic only flows as slowly as necessary (so as not >> to >> >> > > > adversely affect normal replication traffic)." >> >> > > > >> >> > > > Thanks, >> >> > > > >> >> > > > Joel >> >> > > > >> >> > > >>> >> >> > > >>>> On Thu, Aug 11, 2016 at 2:43 PM, Jun Rao <j...@confluent.io >> >> > > >>> <javascript:;>> wrote: >> >> > > >>>> >> >> > > >>>>> Hi, Joel, >> >> > > >>>>> >> >> > > >>>>> Yes, the response size includes both throttled and unthrottled >> >> > > >>> replicas. >> >> > > >>>>> However, the response is only delayed up to max.wait if the >> >> > response >> >> > > >>> size >> >> > > >>>>> is less than min.bytes, which matches the current behavior. >> So, >> >> > there >> >> > > >>> is >> >> > > >>>> no >> >> > > >>>>> extra delay to due throttling, right? For replica fetchers, >> the >> >> > > >> default >> >> > > >>>>> min.byte is 1. So, the response is only delayed if there is no >> >> byte >> >> > > >> in >> >> > > >>>> the >> >> > > >>>>> response, which is what we want. >> >> > > >>>>> >> >> > > >>>>> Thanks, >> >> > > >>>>> >> >> > > >>>>> Jun >> >> > > >>>>> >> >> > > >>>>> On Thu, Aug 11, 2016 at 11:53 AM, Joel Koshy < >> >> jjkosh...@gmail.com >> >> > > >>> <javascript:;>> >> >> > > >>>> wrote: >> >> > > >>>>> >> >> > > >>>>>> Hi Jun, >> >> > > >>>>>> >> >> > > >>>>>> I'm not sure that would work unless we have separate replica >> >> > > >>> fetchers, >> >> > > >>>>>> since this would cause all replicas (including ones that are >> not >> >> > > >>>>> throttled) >> >> > > >>>>>> to get delayed. Instead, we could just have the leader >> populate >> >> > the >> >> > > >>>>>> throttle-time field of the response as a hint to the >> follower as >> >> > to >> >> > > >>> how >> >> > > >>>>>> long it should wait before it adds those replicas back to its >> >> > > >>>> subsequent >> >> > > >>>>>> replica fetch requests. >> >> > > >>>>>> >> >> > > >>>>>> Thanks, >> >> > > >>>>>> >> >> > > >>>>>> Joel >> >> > > >>>>>> >> >> > > >>>>>> On Thu, Aug 11, 2016 at 9:50 AM, Jun Rao <j...@confluent.io >> >> > > >>> <javascript:;>> wrote: >> >> > > >>>>>> >> >> > > >>>>>>> Mayuresh, >> >> > > >>>>>>> >> >> > > >>>>>>> That's a good question. I think if the response size (after >> >> > > >> leader >> >> > > >>>>>>> throttling) is smaller than min.bytes, we will just delay >> the >> >> > > >>> sending >> >> > > >>>>> of >> >> > > >>>>>>> the response up to max.wait as we do now. This should >> prevent >> >> > > >>>> frequent >> >> > > >>>>>>> empty responses to the follower. >> >> > > >>>>>>> >> >> > > >>>>>>> Thanks, >> >> > > >>>>>>> >> >> > > >>>>>>> Jun >> >> > > >>>>>>> >> >> > > >>>>>>> On Wed, Aug 10, 2016 at 9:17 PM, Mayuresh Gharat < >> >> > > >>>>>>> gharatmayures...@gmail.com <javascript:;> >> >> > > >>>>>>>> wrote: >> >> > > >>>>>>> >> >> > > >>>>>>>> This might have been answered before. >> >> > > >>>>>>>> I was wondering when the leader quota is reached and it >> sends >> >> > > >>> empty >> >> > > >>>>>>>> response ( If the inclusion of a partition, listed in the >> >> > > >>> leader's >> >> > > >>>>>>>> throttled-replicas list, causes the LeaderQuotaRate to be >> >> > > >>> exceeded, >> >> > > >>>>>> that >> >> > > >>>>>>>> partition is omitted from the response (aka returns 0 >> >> bytes).). >> >> > > >>> At >> >> > > >>>>> this >> >> > > >>>>>>>> point the follower quota is NOT reached and the follower is >> >> > > >> still >> >> > > >>>>> going >> >> > > >>>>>>> to >> >> > > >>>>>>>> ask for the that partition in the next fetch request. >> Would it >> >> > > >> be >> >> > > >>>>> fair >> >> > > >>>>>> to >> >> > > >>>>>>>> add some logic there so that the follower backs off ( for >> some >> >> > > >>>>>>> configurable >> >> > > >>>>>>>> time) from including those partitions in the next fetch >> >> > > >> request? >> >> > > >>>>>>>> >> >> > > >>>>>>>> Thanks, >> >> > > >>>>>>>> >> >> > > >>>>>>>> Mayuresh >> >> > > >>>>>>>> >> >> > > >>>>>>>> On Wed, Aug 10, 2016 at 8:06 AM, Ben Stopford < >> >> > > >> b...@confluent.io >> >> > > >>> <javascript:;>> >> >> > > >>>>>> wrote: >> >> > > >>>>>>>> >> >> > > >>>>>>>>> Thanks again for the responses everyone. I’ve removed the >> the >> >> > > >>>> extra >> >> > > >>>>>>>>> fetcher threads from the proposal, switching to the >> >> > > >>>> inclusion-based >> >> > > >>>>>>>>> approach. The relevant section is: >> >> > > >>>>>>>>> >> >> > > >>>>>>>>> The follower makes a requests, using the fixed size of >> >> > > >>>>>>>>> replica.fetch.response.max.bytes as per KIP-74 < >> >> > > >>>>>>>> https://cwiki.apache.org/ >> >> > > >>>>>>>>> confluence/display/KAFKA/KIP- >> 74%3A+Add+Fetch+Response+Size+ >> >> > > >>>>>>>> Limit+in+Bytes>. >> >> > > >>>>>>>>> The order of the partitions in the fetch request are >> >> > > >> randomised >> >> > > >>>> to >> >> > > >>>>>>> ensure >> >> > > >>>>>>>>> fairness. >> >> > > >>>>>>>>> When the leader receives the fetch request it processes >> the >> >> > > >>>>>> partitions >> >> > > >>>>>>> in >> >> > > >>>>>>>>> the defined order, up to the response's size limit. If the >> >> > > >>>>> inclusion >> >> > > >>>>>>> of a >> >> > > >>>>>>>>> partition, listed in the leader's throttled-replicas list, >> >> > > >>> causes >> >> > > >>>>> the >> >> > > >>>>>>>>> LeaderQuotaRate to be exceeded, that partition is omitted >> >> > > >> from >> >> > > >>>> the >> >> > > >>>>>>>> response >> >> > > >>>>>>>>> (aka returns 0 bytes). Logically, this is of the form: >> >> > > >>>>>>>>> var bytesAllowedForThrottledPartition = >> >> > > >>>>> quota.recordAndMaybeAdjust( >> >> > > >>>>>>>>> bytesRequestedForPartition) >> >> > > >>>>>>>>> When the follower receives the fetch response, if it >> includes >> >> > > >>>>>>> partitions >> >> > > >>>>>>>>> in its throttled-partitions list, it increments the >> >> > > >>>>>> FollowerQuotaRate: >> >> > > >>>>>>>>> var includeThrottledPartitionsInNextRequest: Boolean = >> >> > > >>>>>>>>> quota.recordAndEvaluate(previousResponseThrottledBytes) >> >> > > >>>>>>>>> If the quota is exceeded, no throttled partitions will be >> >> > > >>>> included >> >> > > >>>>> in >> >> > > >>>>>>> the >> >> > > >>>>>>>>> next fetch request emitted by this replica fetcher thread. >> >> > > >>>>>>>>> >> >> > > >>>>>>>>> B >> >> > > >>>>>>>>> >> >> > > >>>>>>>>>> On 9 Aug 2016, at 23:34, Jun Rao <j...@confluent.io >> >> > > >>> <javascript:;>> wrote: >> >> > > >>>>>>>>>> >> >> > > >>>>>>>>>> When there are several unthrottled replicas, we could >> also >> >> > > >>> just >> >> > > >>>>> do >> >> > > >>>>>>>> what's >> >> > > >>>>>>>>>> suggested in KIP-74. The client is responsible for >> >> > > >> reordering >> >> > > >>>> the >> >> > > >>>>>>>>>> partitions and the leader fills in the bytes to those >> >> > > >>>> partitions >> >> > > >>>>> in >> >> > > >>>>>>>>> order, >> >> > > >>>>>>>>>> up to the quota limit. >> >> > > >>>>>>>>>> >> >> > > >>>>>>>>>> We could also do what you suggested. If quota is >> exceeded, >> >> > > >>>>> include >> >> > > >>>>>>>> empty >> >> > > >>>>>>>>>> data in the response for throttled replicas. Keep doing >> >> > > >> that >> >> > > >>>>> until >> >> > > >>>>>>>> enough >> >> > > >>>>>>>>>> time has passed so that the quota is no longer exceeded. >> >> > > >> This >> >> > > >>>>>>>> potentially >> >> > > >>>>>>>>>> allows better batching per partition. Not sure if the two >> >> > > >>>> makes a >> >> > > >>>>>> big >> >> > > >>>>>>>>>> difference in practice though. >> >> > > >>>>>>>>>> >> >> > > >>>>>>>>>> Thanks, >> >> > > >>>>>>>>>> >> >> > > >>>>>>>>>> Jun >> >> > > >>>>>>>>>> >> >> > > >>>>>>>>>> >> >> > > >>>>>>>>>> On Tue, Aug 9, 2016 at 2:31 PM, Joel Koshy < >> >> > > >>>> jjkosh...@gmail.com <javascript:;>> >> >> > > >>>>>>>> wrote: >> >> > > >>>>>>>>>> >> >> > > >>>>>>>>>>>> >> >> > > >>>>>>>>>>>> >> >> > > >>>>>>>>>>>> >> >> > > >>>>>>>>>>>> On the leader side, one challenge is related to the >> >> > > >>> fairness >> >> > > >>>>>> issue >> >> > > >>>>>>>> that >> >> > > >>>>>>>>>>> Ben >> >> > > >>>>>>>>>>>> brought up. The question is what if the fetch response >> >> > > >>> limit >> >> > > >>>> is >> >> > > >>>>>>>> filled >> >> > > >>>>>>>>> up >> >> > > >>>>>>>>>>>> by the throttled replicas? If this happens constantly, >> we >> >> > > >>>> will >> >> > > >>>>>>> delay >> >> > > >>>>>>>>> the >> >> > > >>>>>>>>>>>> progress of those un-throttled replicas. However, I >> think >> >> > > >>> we >> >> > > >>>>> can >> >> > > >>>>>>>>> address >> >> > > >>>>>>>>>>>> this issue by trying to fill up the unthrottled >> replicas >> >> > > >> in >> >> > > >>>> the >> >> > > >>>>>>>>> response >> >> > > >>>>>>>>>>>> first. So, the algorithm would be. Fill up unthrottled >> >> > > >>>> replicas >> >> > > >>>>>> up >> >> > > >>>>>>> to >> >> > > >>>>>>>>> the >> >> > > >>>>>>>>>>>> fetch response limit. If there is space left, fill up >> >> > > >>>> throttled >> >> > > >>>>>>>>> replicas. >> >> > > >>>>>>>>>>>> If quota is exceeded for the throttled replicas, reduce >> >> > > >> the >> >> > > >>>>> bytes >> >> > > >>>>>>> in >> >> > > >>>>>>>>> the >> >> > > >>>>>>>>>>>> throttled replicas in the response accordingly. >> >> > > >>>>>>>>>>>> >> >> > > >>>>>>>>>>> >> >> > > >>>>>>>>>>> Right - that's what I was trying to convey by truncation >> >> > > >> (vs >> >> > > >>>>>> empty). >> >> > > >>>>>>>> So >> >> > > >>>>>>>>> we >> >> > > >>>>>>>>>>> would attempt to fill the response for throttled >> >> > > >> partitions >> >> > > >>> as >> >> > > >>>>>> much >> >> > > >>>>>>> as >> >> > > >>>>>>>>> we >> >> > > >>>>>>>>>>> can before hitting the quota limit. There is one more >> >> > > >> detail >> >> > > >>>> to >> >> > > >>>>>>> handle >> >> > > >>>>>>>>> in >> >> > > >>>>>>>>>>> this: if there are several throttled partitions and not >> >> > > >>> enough >> >> > > >>>>>>>> remaining >> >> > > >>>>>>>>>>> allowance in the fetch response to include all the >> >> > > >> throttled >> >> > > >>>>>>> replicas >> >> > > >>>>>>>>> then >> >> > > >>>>>>>>>>> we would need to decide which of those partitions get a >> >> > > >>> share; >> >> > > >>>>>> which >> >> > > >>>>>>>> is >> >> > > >>>>>>>>> why >> >> > > >>>>>>>>>>> I'm wondering if it is easier to return empty for those >> >> > > >>>>> partitions >> >> > > >>>>>>>>> entirely >> >> > > >>>>>>>>>>> in the fetch response - they will make progress in the >> >> > > >>>>> subsequent >> >> > > >>>>>>>>> fetch. If >> >> > > >>>>>>>>>>> they don't make fast enough progress then that would be >> a >> >> > > >>> case >> >> > > >>>>> for >> >> > > >>>>>>>>> raising >> >> > > >>>>>>>>>>> the threshold or letting it complete at an off-peak >> time. >> >> > > >>>>>>>>>>> >> >> > > >>>>>>>>>>> >> >> > > >>>>>>>>>>>> >> >> > > >>>>>>>>>>>> With this approach, we need some new logic to handle >> >> > > >>>> throttling >> >> > > >>>>>> on >> >> > > >>>>>>>> the >> >> > > >>>>>>>>>>>> leader, but we can leave the replica threading model >> >> > > >>>> unchanged. >> >> > > >>>>>> So, >> >> > > >>>>>>>>>>>> overall, this still seems to be a simpler approach. >> >> > > >>>>>>>>>>>> >> >> > > >>>>>>>>>>>> Thanks, >> >> > > >>>>>>>>>>>> >> >> > > >>>>>>>>>>>> Jun >> >> > > >>>>>>>>>>>> >> >> > > >>>>>>>>>>>> On Tue, Aug 9, 2016 at 11:57 AM, Mayuresh Gharat < >> >> > > >>>>>>>>>>>> gharatmayures...@gmail.com <javascript:;> >> >> > > >>>>>>>>>>>>> wrote: >> >> > > >>>>>>>>>>>> >> >> > > >>>>>>>>>>>>> Nice write up Ben. >> >> > > >>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>> I agree with Joel for keeping this simple by excluding >> >> > > >> the >> >> > > >>>>>>>> partitions >> >> > > >>>>>>>>>>>> from >> >> > > >>>>>>>>>>>>> the fetch request/response when the quota is violated >> at >> >> > > >>> the >> >> > > >>>>>>>> follower >> >> > > >>>>>>>>>>> or >> >> > > >>>>>>>>>>>>> leader instead of having a separate set of threads for >> >> > > >>>>> handling >> >> > > >>>>>>> the >> >> > > >>>>>>>>>>> quota >> >> > > >>>>>>>>>>>>> and non quota cases. Even though its different from >> the >> >> > > >>>>> current >> >> > > >>>>>>>> quota >> >> > > >>>>>>>>>>>>> implementation it should be OK since its internal to >> >> > > >>> brokers >> >> > > >>>>> and >> >> > > >>>>>>> can >> >> > > >>>>>>>>> be >> >> > > >>>>>>>>>>>>> handled by tuning the quota configs for it >> appropriately >> >> > > >>> by >> >> > > >>>>> the >> >> > > >>>>>>>>> admins. >> >> > > >>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>> Also can you elaborate with an example how this would >> be >> >> > > >>>>>> handled : >> >> > > >>>>>>>>>>>>> *guaranteeing >> >> > > >>>>>>>>>>>>> ordering of updates when replicas shift threads* >> >> > > >>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>> Thanks, >> >> > > >>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>> Mayuresh >> >> > > >>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>> On Tue, Aug 9, 2016 at 10:49 AM, Joel Koshy < >> >> > > >>>>>> jjkosh...@gmail.com <javascript:;>> >> >> > > >>>>>>>>>>> wrote: >> >> > > >>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>> On the need for both leader/follower throttling: that >> >> > > >>> makes >> >> > > >>>>>>> sense - >> >> > > >>>>>>>>>>>>> thanks >> >> > > >>>>>>>>>>>>>> for clarifying. For completeness, can we add this >> >> > > >> detail >> >> > > >>> to >> >> > > >>>>> the >> >> > > >>>>>>>> doc - >> >> > > >>>>>>>>>>>>> say, >> >> > > >>>>>>>>>>>>>> after the quote that I pasted earlier? >> >> > > >>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>> From an implementation perspective though: I’m still >> >> > > >>>>> interested >> >> > > >>>>>>> in >> >> > > >>>>>>>>>>> the >> >> > > >>>>>>>>>>>>>> simplicity of not having to add separate replica >> >> > > >>> fetchers, >> >> > > >>>>>> delay >> >> > > >>>>>>>>>>> queue >> >> > > >>>>>>>>>>>> on >> >> > > >>>>>>>>>>>>>> the leader, and “move” partitions from the throttled >> >> > > >>>> replica >> >> > > >>>>>>>> fetchers >> >> > > >>>>>>>>>>>> to >> >> > > >>>>>>>>>>>>>> the regular replica fetchers once caught up. >> >> > > >>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>> Instead, I think it would work and be simpler to >> >> > > >> include >> >> > > >>> or >> >> > > >>>>>>> exclude >> >> > > >>>>>>>>>>> the >> >> > > >>>>>>>>>>>>>> partitions in the fetch request from the follower and >> >> > > >>> fetch >> >> > > >>>>>>>> response >> >> > > >>>>>>>>>>>> from >> >> > > >>>>>>>>>>>>>> the leader when the quota is violated. The issue of >> >> > > >>>> fairness >> >> > > >>>>>> that >> >> > > >>>>>>>> Ben >> >> > > >>>>>>>>>>>>> noted >> >> > > >>>>>>>>>>>>>> may be a wash between the two options (that Ben wrote >> >> > > >> in >> >> > > >>>> his >> >> > > >>>>>>>> email). >> >> > > >>>>>>>>>>>> With >> >> > > >>>>>>>>>>>>>> the default quota delay mechanism, partitions get >> >> > > >> delayed >> >> > > >>>>>>>> essentially >> >> > > >>>>>>>>>>>> at >> >> > > >>>>>>>>>>>>>> random - i.e., whoever fetches at the time of quota >> >> > > >>>> violation >> >> > > >>>>>>> gets >> >> > > >>>>>>>>>>>>> delayed >> >> > > >>>>>>>>>>>>>> at the leader. So we can adopt a similar policy in >> >> > > >>> choosing >> >> > > >>>>> to >> >> > > >>>>>>>>>>> truncate >> >> > > >>>>>>>>>>>>>> partitions in fetch responses. i.e., if at the time >> of >> >> > > >>>>> handling >> >> > > >>>>>>> the >> >> > > >>>>>>>>>>>> fetch >> >> > > >>>>>>>>>>>>>> the “effect” replication rate exceeds the quota then >> >> > > >>> either >> >> > > >>>>>> empty >> >> > > >>>>>>>> or >> >> > > >>>>>>>>>>>>>> truncate those partitions from the response. (BTW >> >> > > >> effect >> >> > > >>>>>>>> replication >> >> > > >>>>>>>>>>> is >> >> > > >>>>>>>>>>>>>> your terminology in the wiki - i.e., replication due >> to >> >> > > >>>>>> partition >> >> > > >>>>>>>>>>>>>> reassignment, adding brokers, etc.) >> >> > > >>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>> While this may be slightly different from the >> existing >> >> > > >>>> quota >> >> > > >>>>>>>>>>> mechanism >> >> > > >>>>>>>>>>>> I >> >> > > >>>>>>>>>>>>>> think the difference is small (since we would reuse >> the >> >> > > >>>> quota >> >> > > >>>>>>>> manager >> >> > > >>>>>>>>>>>> at >> >> > > >>>>>>>>>>>>>> worst with some refactoring) and will be internal to >> >> > > >> the >> >> > > >>>>>> broker. >> >> > > >>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>> So I guess the question is if this alternative is >> >> > > >> simpler >> >> > > >>>>>> enough >> >> > > >>>>>>>> and >> >> > > >>>>>>>>>>>>>> equally functional to not go with dedicated throttled >> >> > > >>>> replica >> >> > > >>>>>>>>>>> fetchers. >> >> > > >>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>> On Tue, Aug 9, 2016 at 9:44 AM, Jun Rao < >> >> > > >>> j...@confluent.io <javascript:;>> >> >> > > >>>>>>> wrote: >> >> > > >>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>>> Just to elaborate on what Ben said why we need >> >> > > >>> throttling >> >> > > >>>> on >> >> > > >>>>>>> both >> >> > > >>>>>>>>>>> the >> >> > > >>>>>>>>>>>>>>> leader and the follower side. >> >> > > >>>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>>> If we only have throttling on the follower side, >> >> > > >>> consider >> >> > > >>>> a >> >> > > >>>>>> case >> >> > > >>>>>>>>>>> that >> >> > > >>>>>>>>>>>>> we >> >> > > >>>>>>>>>>>>>>> add 5 more new brokers and want to move some >> replicas >> >> > > >>> from >> >> > > >>>>>>>> existing >> >> > > >>>>>>>>>>>>>> brokers >> >> > > >>>>>>>>>>>>>>> over to those 5 brokers. Each of those broker is >> going >> >> > > >>> to >> >> > > >>>>>> fetch >> >> > > >>>>>>>>>>> data >> >> > > >>>>>>>>>>>>> from >> >> > > >>>>>>>>>>>>>>> all existing brokers. Then, it's possible that the >> >> > > >>>>> aggregated >> >> > > >>>>>>>> fetch >> >> > > >>>>>>>>>>>>> load >> >> > > >>>>>>>>>>>>>>> from those 5 brokers on a particular existing broker >> >> > > >>>> exceeds >> >> > > >>>>>> its >> >> > > >>>>>>>>>>>>> outgoing >> >> > > >>>>>>>>>>>>>>> network bandwidth, even though the inbounding >> traffic >> >> > > >> on >> >> > > >>>>> each >> >> > > >>>>>> of >> >> > > >>>>>>>>>>>> those >> >> > > >>>>>>>>>>>>> 5 >> >> > > >>>>>>>>>>>>>>> brokers is bounded. >> >> > > >>>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>>> If we only have throttling on the leader side, >> >> > > >> consider >> >> > > >>>> the >> >> > > >>>>>> same >> >> > > >>>>>>>>>>>>> example >> >> > > >>>>>>>>>>>>>>> above. It's possible for the incoming traffic to >> each >> >> > > >> of >> >> > > >>>>>> those 5 >> >> > > >>>>>>>>>>>>> brokers >> >> > > >>>>>>>>>>>>>> to >> >> > > >>>>>>>>>>>>>>> exceed its network bandwidth since it is fetching >> data >> >> > > >>>> from >> >> > > >>>>>> all >> >> > > >>>>>>>>>>>>> existing >> >> > > >>>>>>>>>>>>>>> brokers. >> >> > > >>>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>>> So, being able to set a quota on both the follower >> and >> >> > > >>> the >> >> > > >>>>>>> leader >> >> > > >>>>>>>>>>>> side >> >> > > >>>>>>>>>>>>>>> protects both cases. >> >> > > >>>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>>> Thanks, >> >> > > >>>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>>> Jun >> >> > > >>>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>>> On Tue, Aug 9, 2016 at 4:43 AM, Ben Stopford < >> >> > > >>>>>> b...@confluent.io <javascript:;>> >> >> > > >>>>>>>>>>>> wrote: >> >> > > >>>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>>>> Hi Joel >> >> > > >>>>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>>>> Thanks for taking the time to look at this. >> >> > > >>> Appreciated. >> >> > > >>>>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>>>> Regarding throttling on both leader and follower, >> >> > > >> this >> >> > > >>>>>> proposal >> >> > > >>>>>>>>>>>>> covers >> >> > > >>>>>>>>>>>>>> a >> >> > > >>>>>>>>>>>>>>>> more general solution which can guarantee a quota, >> >> > > >> even >> >> > > >>>>> when >> >> > > >>>>>> a >> >> > > >>>>>>>>>>>>>> rebalance >> >> > > >>>>>>>>>>>>>>>> operation produces an asymmetric profile of load. >> >> > > >> This >> >> > > >>>>> means >> >> > > >>>>>>>>>>>>>>> administrators >> >> > > >>>>>>>>>>>>>>>> don’t need to calculate the impact that a >> >> > > >> follower-only >> >> > > >>>>> quota >> >> > > >>>>>>>>>>> will >> >> > > >>>>>>>>>>>>> have >> >> > > >>>>>>>>>>>>>>> on >> >> > > >>>>>>>>>>>>>>>> the leaders they are fetching from. So for example >> >> > > >>> where >> >> > > >>>>>>> replica >> >> > > >>>>>>>>>>>>> sizes >> >> > > >>>>>>>>>>>>>>> are >> >> > > >>>>>>>>>>>>>>>> skewed or where a partial rebalance is required. >> >> > > >>>>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>>>> Having said that, even with both leader and >> follower >> >> > > >>>>> quotas, >> >> > > >>>>>>> the >> >> > > >>>>>>>>>>>> use >> >> > > >>>>>>>>>>>>> of >> >> > > >>>>>>>>>>>>>>>> additional threads is actually optional. There >> appear >> >> > > >>> to >> >> > > >>>> be >> >> > > >>>>>> two >> >> > > >>>>>>>>>>>>> general >> >> > > >>>>>>>>>>>>>>>> approaches (1) omit partitions from fetch requests >> >> > > >>>>>> (follower) / >> >> > > >>>>>>>>>>>> fetch >> >> > > >>>>>>>>>>>>>>>> responses (leader) when they exceed their quota (2) >> >> > > >>> delay >> >> > > >>>>>> them, >> >> > > >>>>>>>>>>> as >> >> > > >>>>>>>>>>>>> the >> >> > > >>>>>>>>>>>>>>>> existing quota mechanism does, using separate >> >> > > >> fetchers. >> >> > > >>>>> Both >> >> > > >>>>>>>>>>> appear >> >> > > >>>>>>>>>>>>>>> valid, >> >> > > >>>>>>>>>>>>>>>> but with slightly different design tradeoffs. >> >> > > >>>>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>>>> The issue with approach (1) is that it departs >> >> > > >> somewhat >> >> > > >>>>> from >> >> > > >>>>>>> the >> >> > > >>>>>>>>>>>>>> existing >> >> > > >>>>>>>>>>>>>>>> quotas implementation, and must include a notion of >> >> > > >>>>> fairness >> >> > > >>>>>>>>>>>> within, >> >> > > >>>>>>>>>>>>>> the >> >> > > >>>>>>>>>>>>>>>> now size-bounded, request and response. The issue >> >> > > >> with >> >> > > >>>> (2) >> >> > > >>>>> is >> >> > > >>>>>>>>>>>>>>> guaranteeing >> >> > > >>>>>>>>>>>>>>>> ordering of updates when replicas shift threads, >> but >> >> > > >>> this >> >> > > >>>>> is >> >> > > >>>>>>>>>>>> handled, >> >> > > >>>>>>>>>>>>>> for >> >> > > >>>>>>>>>>>>>>>> the most part, in the code today. >> >> > > >>>>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>>>> I’ve updated the rejected alternatives section to >> >> > > >> make >> >> > > >>>>> this a >> >> > > >>>>>>>>>>>> little >> >> > > >>>>>>>>>>>>>>>> clearer. >> >> > > >>>>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>>>> B >> >> > > >>>>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>>>>> On 8 Aug 2016, at 20:38, Joel Koshy < >> >> > > >>>> jjkosh...@gmail.com <javascript:;>> >> >> > > >>>>>>>>>>> wrote: >> >> > > >>>>>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>>>>> Hi Ben, >> >> > > >>>>>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>>>>> Thanks for the detailed write-up. So the proposal >> >> > > >>>> involves >> >> > > >>>>>>>>>>>>>>>> self-throttling >> >> > > >>>>>>>>>>>>>>>>> on the fetcher side and throttling at the leader. >> >> > > >> Can >> >> > > >>>> you >> >> > > >>>>>>>>>>>> elaborate >> >> > > >>>>>>>>>>>>>> on >> >> > > >>>>>>>>>>>>>>>> the >> >> > > >>>>>>>>>>>>>>>>> reasoning that is given on the wiki: *“The >> throttle >> >> > > >> is >> >> > > >>>>>> applied >> >> > > >>>>>>>>>>> to >> >> > > >>>>>>>>>>>>>> both >> >> > > >>>>>>>>>>>>>>>>> leaders and followers. This allows the admin to >> >> > > >> exert >> >> > > >>>>> strong >> >> > > >>>>>>>>>>>>>> guarantees >> >> > > >>>>>>>>>>>>>>>> on >> >> > > >>>>>>>>>>>>>>>>> the throttle limit".* Is there any reason why one >> or >> >> > > >>> the >> >> > > >>>>>> other >> >> > > >>>>>>>>>>>>>> wouldn't >> >> > > >>>>>>>>>>>>>>>> be >> >> > > >>>>>>>>>>>>>>>>> sufficient. >> >> > > >>>>>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>>>>> Specifically, if we were to only do >> self-throttling >> >> > > >> on >> >> > > >>>> the >> >> > > >>>>>>>>>>>>> fetchers, >> >> > > >>>>>>>>>>>>>> we >> >> > > >>>>>>>>>>>>>>>>> could potentially avoid the additional replica >> >> > > >>> fetchers >> >> > > >>>>>> right? >> >> > > >>>>>>>>>>>>> i.e., >> >> > > >>>>>>>>>>>>>>> the >> >> > > >>>>>>>>>>>>>>>>> replica fetchers would maintain its quota metrics >> as >> >> > > >>> you >> >> > > >>>>>>>>>>> proposed >> >> > > >>>>>>>>>>>>> and >> >> > > >>>>>>>>>>>>>>>> each >> >> > > >>>>>>>>>>>>>>>>> (normal) replica fetch presents an opportunity to >> >> > > >> make >> >> > > >>>>>>> progress >> >> > > >>>>>>>>>>>> for >> >> > > >>>>>>>>>>>>>> the >> >> > > >>>>>>>>>>>>>>>>> throttled partitions as long as their effective >> >> > > >>>>> consumption >> >> > > >>>>>>>>>>> rate >> >> > > >>>>>>>>>>>> is >> >> > > >>>>>>>>>>>>>>> below >> >> > > >>>>>>>>>>>>>>>>> the quota limit. If it exceeds the consumption >> rate >> >> > > >>> then >> >> > > >>>>>> don’t >> >> > > >>>>>>>>>>>>>> include >> >> > > >>>>>>>>>>>>>>>> the >> >> > > >>>>>>>>>>>>>>>>> throttled partitions in the subsequent fetch >> >> > > >> requests >> >> > > >>>>> until >> >> > > >>>>>>> the >> >> > > >>>>>>>>>>>>>>> effective >> >> > > >>>>>>>>>>>>>>>>> consumption rate for those partitions returns to >> >> > > >>> within >> >> > > >>>>> the >> >> > > >>>>>>>>>>> quota >> >> > > >>>>>>>>>>>>>>>> threshold. >> >> > > >>>>>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>>>>> I have more questions on the proposal, but was >> more >> >> > > >>>>>> interested >> >> > > >>>>>>>>>>> in >> >> > > >>>>>>>>>>>>> the >> >> > > >>>>>>>>>>>>>>>> above >> >> > > >>>>>>>>>>>>>>>>> to see if it could simplify things a bit. >> >> > > >>>>>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>>>>> Also, can you open up access to the google-doc >> that >> >> > > >>> you >> >> > > >>>>> link >> >> > > >>>>>>>>>>> to? >> >> > > >>>>>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>>>>> Thanks, >> >> > > >>>>>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>>>>> Joel >> >> > > >>>>>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>>>>> On Mon, Aug 8, 2016 at 5:54 AM, Ben Stopford < >> >> > > >>>>>>> b...@confluent.io <javascript:;> >> >> > > >>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>> wrote: >> >> > > >>>>>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>>>>>> We’ve created KIP-73: Replication Quotas >> >> > > >>>>>>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>>>>>> The idea is to allow an admin to throttle moving >> >> > > >>>>> replicas. >> >> > > >>>>>>>>>>> Full >> >> > > >>>>>>>>>>>>>>> details >> >> > > >>>>>>>>>>>>>>>>>> are here: >> >> > > >>>>>>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>>>>>> https://cwiki.apache.org/ >> >> > > >>> confluence/display/KAFKA/KIP- >> >> > > >>>>> 73+ >> >> > > >>>>>>>>>>>>>>>>>> Replication+Quotas < >> https://cwiki.apache.org/conf >> >> > > >>>>>>>>>>>>>>>>>> luence/display/KAFKA/KIP-73+Replication+Quotas> >> >> > > >>>>>>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>>>>>> Please take a look and let us know your thoughts. >> >> > > >>>>>>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>>>>>> Thanks >> >> > > >>>>>>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>>>>>> B >> >> > > >>>>>>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>> >> >> > > >>>>>>>>>>>>> -- >> >> > > >>>>>>>>>>>>> -Regards, >> >> > > >>>>>>>>>>>>> Mayuresh R. Gharat >> >> > > >>>>>>>>>>>>> (862) 250-7125 >> >> > > >>>>>>>>>>>>> >> >> > > >>>>>>>>>>>> >> >> > > >>>>>>>>>>> >> >> > > >>>>>>>>> >> >> > > >>>>>>>>> >> >> > > >>>>>>>> >> >> > > >>>>>>>> >> >> > > >>>>>>>> -- >> >> > > >>>>>>>> -Regards, >> >> > > >>>>>>>> Mayuresh R. Gharat >> >> > > >>>>>>>> (862) 250-7125 >> >> > > >>>>>>>> >> >> > > >>>>>>> >> >> > > >>>>>> >> >> > > >>>>> >> >> > > >>>> >> >> > > >>>> >> >> > > >>>> >> >> > > >>>> -- >> >> > > >>>> -Regards, >> >> > > >>>> Mayuresh R. Gharat >> >> > > >>>> (862) 250-7125 >> >> > > >>>> >> >> > > >>> >> >> > > >> >> >> > > >> >> >> > > >> -- >> >> > > >> Ben Stopford >> >> > > >> >> >> > > >> >> > > >> >> > >> >> >> >> >> >> -- >> Gwen Shapira >> Product Manager | Confluent >> 650.450.2760 | @gwenshap >> Follow us: Twitter | blog >> > > > > -- > *Todd Palino* > Staff Site Reliability Engineer > Data Infrastructure Streaming > > > > linkedin.com/in/toddpalino -- Gwen Shapira Product Manager | Confluent 650.450.2760 | @gwenshap Follow us: Twitter | blog