Hi Bruno, dropping config rack.aware.assignment.enabled and add value NONE to the enum for the possible values of config rack.aware.assignment.strategy sounds good to me.
On Thu, Jun 1, 2023 at 12:39 AM Bruno Cadonna <cado...@apache.org> wrote: > Hi Hao, > > Thanks for the updates! > > What do you think about dropping config rack.aware.assignment.enabled > and add value NONE to the enum for the possible values of config > rack.aware.assignment.strategy? > > Best, > Bruno > > On 31.05.23 23:31, Hao Li wrote: > > Hi all, > > > > I've updated the KIP based on the feedback. Major changes I made: > > 1. Add rack aware assignment to `StickyTaskAssignor` > > 2. Reject `Prefer reliability and then find optimal cost` option in > standby > > task assignment. > > > > > > On Wed, May 31, 2023 at 12:09 PM Hao Li <h...@confluent.io> wrote: > > > >> Hi all, > >> > >> Thanks for the feedback! I will update the KIP accordingly. > >> > >> *For Sophie's comments:* > >> > >> 1 and 2. Good catch. Fixed these. > >> > >> 3 and 4 Yes. We can make this public config and call out the > >> clientConsumer config users need to set. > >> > >> 5. It's ideal to take the previous assignment in HAAssignor into > >> consideration when we compute our target assignment, the complications > come > >> with making sure the assignment can eventually converge and we don't do > >> probing rebalance infinitely. It's not only about storing the previous > >> assignment or get it somehow. We can actually get the previous > assignment > >> now like we do in StickyAssignor. But the previous assignment will > change > >> in each round of probing rebalance. The proposal which added some > weight to > >> make the rack aware assignment lean towards the original HAA's target > >> assignment will add benefits of stability in some corner cases in case > of > >> tie in cross rack traffic cost. But it's not sticky. But the bottom > line is > >> it won't be worse than current HAA's stickiness. > >> > >> 6. I'm fine with changing the assignor config to public. Actually, I > think > >> we can min-cost algorithm with StickyAssignor as well to mitigate the > >> problem of 5. So we can have one public config to choose an assignor and > >> one public config to enable the rack aware assignment. > >> > >> *For Bruno's comments:* > >> > >> The proposal was to implement all the options and use configs to choose > >> them during runtime. We can make those configs public as suggested. > >> 1, 2, 3, 4, 5: agree and will fix those. > >> 6: subscription protocol is not changed. > >> 7: yeah. Let me fix the notations. > >> 8: It meant clients. In the figure, it maps to `c1_1`, `c1_2`, `c1_3` > etc. > >> 9: I'm also ok with just optimizing reliability for standby tasks. Or we > >> could simply run the "balance reliability over cost" greedy algorithm to > >> see if any cost could be reduced. > >> 10: Make sense. Will fix the wording. > >> 11: Make sense. Will update the test part. > >> > >> *For Walker's comments:* > >> 1. Stability for HAA is an issue. See my comments for Sophie's feedback > 5 > >> and 6. I think we could use the rack aware assignment for > StickyAssignor as > >> well. For HAA assignments, it's less sticky and we can only shoot for > >> minimizing the cross rack traffic eventually when everything is stable. > >> 2. Yeah. This is a good point and we can also turn it on for > >> StickyAssignor. > >> > >> Thanks, > >> Hao > >> > >> > >> On Tue, May 30, 2023 at 2:28 PM Sophie Blee-Goldman < > >> ableegold...@gmail.com> wrote: > >> > >>> Hey Hao, thanks for the KIP! > >>> > >>> 1. There's a typo in the "internal.rack.aware.assignment.strategry" > >>> config, > >>> this > >>> should be internal.rack.aware.assignment.strategy. > >>> > >>> 2. > >>> > >>>> For O(E^2 * (CU)) complexity, C and U can be viewed as constant. > >>> Number of > >>>> edges E is T * N where T is the number of clients and N is the number > of > >>>> Tasks. This is because a task can be assigned to any client so there > >>> will > >>>> be an edge between every task and every client. The total complexity > >>> would > >>>> be O(T * N) if we want to be more specific. > >>> > >>> I feel like I'm missing something here, but if E = T * N and the > >>> complexity > >>> is ~O(E^2), doesn't > >>> this make the total complexity order of O(T^2 * N^2)? > >>> > >>> 3. > >>> > >>>> Since 3.C.I and 3.C.II have different tradeoffs and work better in > >>>> different workloads etc, we > >>> > >>> could add an internal configuration to choose one of them at runtime. > >>>> > >>> Why only an internal configuration? Same goes for > >>> internal.rack.aware.assignment.standby.strategry (which also has the > typo) > >>> > >>> 4. > >>> > >>>> There are no changes in public interfaces. > >>> > >>> I think it would be good to explicitly call out that users can utilize > >>> this > >>> new feature by setting the > >>> ConsumerConfig's CLIENT_RACK_CONFIG, possibly with a brief example > >>> > >>> 5. > >>> > >>>> The idea is that if we always try to make it overlap as much with > >>>> HAAssignor’s target > >>> > >>> assignment, at least there’s a higher chance that tasks won’t be > shuffled > >>> a > >>>> lot if the clients > >>> > >>> remain the same across rebalances. > >>>> > >>> This line definitely gave me some pause -- if there was one major > takeaway > >>> I had after KIP-441, > >>> one thing that most limited the feature's success, it was our > assumption > >>> that clients are relatively > >>> stable across rebalances. This was mostly true at limited scale or for > >>> on-prem setups, but > >>> unsurprisingly broke down in cloud environments or larger clusters. Not > >>> only do clients naturally > >>> fall in and out of the group, autoscaling is becoming more and more of > a > >>> thing. > >>> > >>> Lastly, and this is more easily solved but still worth calling out, an > >>> assignment is only deterministic > >>> as long as the client.id is persisted. Currently in Streams, we only > >>> write > >>> the process UUID to the > >>> state directory if there is one, ie if at least one persistent stateful > >>> task exists in the topology. This > >>> made sense in the context of KIP-441, which targeted heavily stateful > >>> deployments, but this KIP > >>> presumably intends to target more than just the persistent & stateful > >>> subset of applications. To > >>> make matters even worse, "persistent" is defined in a semantically > >>> inconsistent way throughout > >>> Streams. > >>> > >>> All this is to say, it may sound more complicated to remember the > previous > >>> assignment, but (a) > >>> imo it only introduces a lot more complexity and shaky assumptions to > >>> continue down this > >>> path, and (b) we actually already do persist some amount of state, like > >>> the > >>> process UUID, and > >>> (c) it seems like this is the perfect opportunity to finally rid > ourselves > >>> of the determinism constraint > >>> which has frankly caused more trouble and time lost in sum than it > would > >>> have taken us to just > >>> write the HighAvailabilityTaskAssignor to consider the previous > assignment > >>> from the start in KIP-441 > >>> > >>> 6. > >>> > >>>> StickyTaskAssignor users who would like to use rack aware assignment > >>>> should upgrade their > >>> > >>> Kafka Streams version to the version in which > HighAvailabilityTaskAssignor > >>>> and rack awareness > >>> > >>> assignment are available. > >>> > >>> Building off of the above, the HAAssignor hasn't worked out perfectly > for > >>> everybody up until now, > >>> given that we are only adding complexity to it now, on the flipside I > >>> would > >>> hesitate to try and force > >>> everyone to use it if they want to upgrade. We added a "secret" > backdoor > >>> internal config to allow > >>> users to set the task assignor back in KIP-441 for this reason. WDYT > about > >>> bumping this to a public > >>> config on the side in this KIP? > >>> > >>> > >>> On Tue, May 23, 2023 at 11:46 AM Hao Li <h...@confluent.io.invalid> > wrote: > >>> > >>>> Thanks John! Yeah. The ConvergenceTest looks very helpful. I will add > >>> it to > >>>> the test plan. I will also add tests to verify the new optimizer will > >>>> produce a balanced assignment which has no worse cross AZ cost than > the > >>>> existing assignor. > >>>> > >>>> Hao > >>>> > >>>> On Mon, May 22, 2023 at 3:39 PM John Roesler <vvcep...@apache.org> > >>> wrote: > >>>> > >>>>> Hi Hao, > >>>>> > >>>>> Thanks for the KIP! > >>>>> > >>>>> Overall, I think this is a great idea. I always wanted to circle back > >>>>> after the Smooth Scaling KIP to put a proper optimization algorithm > >>> into > >>>>> place. I think this has the promise to really improve the quality of > >>> the > >>>>> balanced assignments we produce. > >>>>> > >>>>> Thanks for providing the details about the MaxCut/MinFlow algorithm. > >>> It > >>>>> seems like a good choice for me, assuming we choose the right scaling > >>>>> factors for the weights we add to the graph. Unfortunately, I don't > >>> think > >>>>> that there's a good way to see how easy or hard this is going to be > >>> until > >>>>> we actually implement it and test it. > >>>>> > >>>>> That leads to the only real piece of feedback I had on the KIP, which > >>> is > >>>>> the testing portion. You mentioned system/integration/unit tests, but > >>>>> there's not too much information about what those tests will do. I'd > >>> like > >>>>> to suggest that we invest in more simulation testing specifically, > >>>> similar > >>>>> to what we did in > >>>>> > >>>> > >>> > https://github.com/apache/kafka/blob/trunk/streams/src/test/java/org/apache/kafka/streams/processor/internals/assignment/TaskAssignorConvergenceTest.java > >>>>> . > >>>>> > >>>>> In fact, it seems like we _could_ write the simulation up front, and > >>> then > >>>>> implement the algorithm in a dummy way and just see whether it passes > >>> the > >>>>> simulations or not, before actually integrating it with Kafka > Streams. > >>>>> > >>>>> Basically, I'd be +1 on this KIP today, but I'd feel confident about > >>> it > >>>> if > >>>>> we had a little more detail regarding how we are going to verify that > >>> the > >>>>> new optimizer is actually going to produce more optimal plans than > the > >>>>> existing assigner we have today. > >>>>> > >>>>> Thanks again! > >>>>> -John > >>>>> > >>>>> On 2023/05/22 16:49:22 Hao Li wrote: > >>>>>> Hi Colt, > >>>>>> > >>>>>> Thanks for the comments. > >>>>>> > >>>>>>> and I struggle to see how the algorithm isn't at least O(N) where > >>> N > >>>> is > >>>>>> the number of Tasks...? > >>>>>> > >>>>>> For O(E^2 * (CU)) complexity, C and U can be viewed as constant. > >>> Number > >>>>> of > >>>>>> edges E is T * N where T is the number of clients and N is the > >>> number > >>>> of > >>>>>> Tasks. This is because a task can be assigned to any client so there > >>>> will > >>>>>> be an edge between every task and every client. The total complexity > >>>>> would > >>>>>> be O(T * N) if we want to be more specific. > >>>>>> > >>>>>>> But if the leaders for each partition are spread across multiple > >>>> zones, > >>>>>> how will you handle that? > >>>>>> > >>>>>> This is what the min-cost flow solution is trying to solve? i.e. > >>> Find > >>>> an > >>>>>> assignment of tasks to clients where across AZ traffic can be > >>>> minimized. > >>>>>> But there are some constraints to the solution and one of them is we > >>>> need > >>>>>> to balance task assignment first ( > >>>>>> > >>>>> > >>>> > >>> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-925%3A+Rack+aware+task+assignment+in+Kafka+Streams#KIP925:RackawaretaskassignmentinKafkaStreams-Designforrackawareassignment > >>>>> ). > >>>>>> So in your example of three tasks' partitions being in the same AZ > >>> of a > >>>>>> client, if there are other clients, we still want to balance the > >>> tasks > >>>> to > >>>>>> other clients even if putting all tasks to a single client can > >>> result > >>>> in > >>>>> 0 > >>>>>> cross AZ traffic. In > >>>>>> > >>>>> > >>>> > >>> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-925%3A+Rack+aware+task+assignment+in+Kafka+Streams#KIP925:RackawaretaskassignmentinKafkaStreams-Algorithm > >>>>>> section, the algorithm will try to find a min-cost solution based on > >>>>>> balanced assignment instead of pure min-cost. > >>>>>> > >>>>>> Thanks, > >>>>>> Hao > >>>>>> > >>>>>> On Tue, May 9, 2023 at 5:55 PM Colt McNealy <c...@littlehorse.io> > >>>> wrote: > >>>>>> > >>>>>>> Hello Hao, > >>>>>>> > >>>>>>> First of all, THANK YOU for putting this together. I had been > >>> hoping > >>>>>>> someone might bring something like this forward. A few comments: > >>>>>>> > >>>>>>> **1: Runtime Complexity > >>>>>>>> Klein’s cycle canceling algorithm can solve the min-cost flow > >>>>> problem in > >>>>>>> O(E^2CU) time where C is max cost and U is max capacity. In our > >>>>> particular > >>>>>>> case, C is 1 and U is at most 3 (A task can have at most 3 topics > >>>>> including > >>>>>>> changelog topic?). So the algorithm runs in O(E^2) time for our > >>> case. > >>>>>>> > >>>>>>> A Task can have multiple input topics, and also multiple state > >>>> stores, > >>>>> and > >>>>>>> multiple output topics. The most common case is three topics as > >>> you > >>>>>>> described, but this is not necessarily guaranteed. Also, math is > >>> one > >>>>> of my > >>>>>>> weak points, but to me O(E^2) is equivalent to O(1), and I > >>> struggle > >>>> to > >>>>> see > >>>>>>> how the algorithm isn't at least O(N) where N is the number of > >>>>> Tasks...? > >>>>>>> > >>>>>>> **2: Broker-Side Partition Assignments > >>>>>>> Consider the case with just three topics in a Task (one input, one > >>>>> output, > >>>>>>> one changelog). If all three partition leaders are in the same > >>> Rack > >>>> (or > >>>>>>> better yet, the same broker), then we could get massive savings by > >>>>>>> assigning the Task to that Rack/availability zone. But if the > >>> leaders > >>>>> for > >>>>>>> each partition are spread across multiple zones, how will you > >>> handle > >>>>> that? > >>>>>>> Is that outside the scope of this KIP, or is it worth introducing > >>> a > >>>>>>> kafka-streams-generate-rebalance-proposal.sh tool? > >>>>>>> > >>>>>>> Colt McNealy > >>>>>>> *Founder, LittleHorse.io* > >>>>>>> > >>>>>>> > >>>>>>> On Tue, May 9, 2023 at 4:03 PM Hao Li <h...@confluent.io.invalid> > >>>>> wrote: > >>>>>>> > >>>>>>>> Hi all, > >>>>>>>> > >>>>>>>> I have submitted KIP-925 to add rack awareness logic in task > >>>>> assignment > >>>>>>> in > >>>>>>>> Kafka Streams and would like to start a discussion: > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>> > >>>>> > >>>> > >>> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-925%3A+Rack+aware+task+assignment+in+Kafka+Streams > >>>>>>>> > >>>>>>>> -- > >>>>>>>> Thanks, > >>>>>>>> Hao > >>>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>>> -- > >>>>>> Thanks, > >>>>>> Hao > >>>>>> > >>>>> > >>>> > >>>> > >>>> -- > >>>> Thanks, > >>>> Hao > >>>> > >>> > >> > >> > >> -- > >> Thanks, > >> Hao > >> > > > > > -- Thanks, Hao