Hey Hao, thanks for the KIP! 1. There's a typo in the "internal.rack.aware.assignment.strategry" config, this should be internal.rack.aware.assignment.strategy.
2. > For O(E^2 * (CU)) complexity, C and U can be viewed as constant. Number of > edges E is T * N where T is the number of clients and N is the number of > Tasks. This is because a task can be assigned to any client so there will > be an edge between every task and every client. The total complexity would > be O(T * N) if we want to be more specific. I feel like I'm missing something here, but if E = T * N and the complexity is ~O(E^2), doesn't this make the total complexity order of O(T^2 * N^2)? 3. > Since 3.C.I and 3.C.II have different tradeoffs and work better in > different workloads etc, we could add an internal configuration to choose one of them at runtime. > Why only an internal configuration? Same goes for internal.rack.aware.assignment.standby.strategry (which also has the typo) 4. > There are no changes in public interfaces. I think it would be good to explicitly call out that users can utilize this new feature by setting the ConsumerConfig's CLIENT_RACK_CONFIG, possibly with a brief example 5. > The idea is that if we always try to make it overlap as much with > HAAssignor’s target assignment, at least there’s a higher chance that tasks won’t be shuffled a > lot if the clients remain the same across rebalances. > This line definitely gave me some pause -- if there was one major takeaway I had after KIP-441, one thing that most limited the feature's success, it was our assumption that clients are relatively stable across rebalances. This was mostly true at limited scale or for on-prem setups, but unsurprisingly broke down in cloud environments or larger clusters. Not only do clients naturally fall in and out of the group, autoscaling is becoming more and more of a thing. Lastly, and this is more easily solved but still worth calling out, an assignment is only deterministic as long as the client.id is persisted. Currently in Streams, we only write the process UUID to the state directory if there is one, ie if at least one persistent stateful task exists in the topology. This made sense in the context of KIP-441, which targeted heavily stateful deployments, but this KIP presumably intends to target more than just the persistent & stateful subset of applications. To make matters even worse, "persistent" is defined in a semantically inconsistent way throughout Streams. All this is to say, it may sound more complicated to remember the previous assignment, but (a) imo it only introduces a lot more complexity and shaky assumptions to continue down this path, and (b) we actually already do persist some amount of state, like the process UUID, and (c) it seems like this is the perfect opportunity to finally rid ourselves of the determinism constraint which has frankly caused more trouble and time lost in sum than it would have taken us to just write the HighAvailabilityTaskAssignor to consider the previous assignment from the start in KIP-441 6. > StickyTaskAssignor users who would like to use rack aware assignment > should upgrade their Kafka Streams version to the version in which HighAvailabilityTaskAssignor > and rack awareness assignment are available. Building off of the above, the HAAssignor hasn't worked out perfectly for everybody up until now, given that we are only adding complexity to it now, on the flipside I would hesitate to try and force everyone to use it if they want to upgrade. We added a "secret" backdoor internal config to allow users to set the task assignor back in KIP-441 for this reason. WDYT about bumping this to a public config on the side in this KIP? On Tue, May 23, 2023 at 11:46 AM Hao Li <h...@confluent.io.invalid> wrote: > Thanks John! Yeah. The ConvergenceTest looks very helpful. I will add it to > the test plan. I will also add tests to verify the new optimizer will > produce a balanced assignment which has no worse cross AZ cost than the > existing assignor. > > Hao > > On Mon, May 22, 2023 at 3:39 PM John Roesler <vvcep...@apache.org> wrote: > > > Hi Hao, > > > > Thanks for the KIP! > > > > Overall, I think this is a great idea. I always wanted to circle back > > after the Smooth Scaling KIP to put a proper optimization algorithm into > > place. I think this has the promise to really improve the quality of the > > balanced assignments we produce. > > > > Thanks for providing the details about the MaxCut/MinFlow algorithm. It > > seems like a good choice for me, assuming we choose the right scaling > > factors for the weights we add to the graph. Unfortunately, I don't think > > that there's a good way to see how easy or hard this is going to be until > > we actually implement it and test it. > > > > That leads to the only real piece of feedback I had on the KIP, which is > > the testing portion. You mentioned system/integration/unit tests, but > > there's not too much information about what those tests will do. I'd like > > to suggest that we invest in more simulation testing specifically, > similar > > to what we did in > > > https://github.com/apache/kafka/blob/trunk/streams/src/test/java/org/apache/kafka/streams/processor/internals/assignment/TaskAssignorConvergenceTest.java > > . > > > > In fact, it seems like we _could_ write the simulation up front, and then > > implement the algorithm in a dummy way and just see whether it passes the > > simulations or not, before actually integrating it with Kafka Streams. > > > > Basically, I'd be +1 on this KIP today, but I'd feel confident about it > if > > we had a little more detail regarding how we are going to verify that the > > new optimizer is actually going to produce more optimal plans than the > > existing assigner we have today. > > > > Thanks again! > > -John > > > > On 2023/05/22 16:49:22 Hao Li wrote: > > > Hi Colt, > > > > > > Thanks for the comments. > > > > > > > and I struggle to see how the algorithm isn't at least O(N) where N > is > > > the number of Tasks...? > > > > > > For O(E^2 * (CU)) complexity, C and U can be viewed as constant. Number > > of > > > edges E is T * N where T is the number of clients and N is the number > of > > > Tasks. This is because a task can be assigned to any client so there > will > > > be an edge between every task and every client. The total complexity > > would > > > be O(T * N) if we want to be more specific. > > > > > > > But if the leaders for each partition are spread across multiple > zones, > > > how will you handle that? > > > > > > This is what the min-cost flow solution is trying to solve? i.e. Find > an > > > assignment of tasks to clients where across AZ traffic can be > minimized. > > > But there are some constraints to the solution and one of them is we > need > > > to balance task assignment first ( > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-925%3A+Rack+aware+task+assignment+in+Kafka+Streams#KIP925:RackawaretaskassignmentinKafkaStreams-Designforrackawareassignment > > ). > > > So in your example of three tasks' partitions being in the same AZ of a > > > client, if there are other clients, we still want to balance the tasks > to > > > other clients even if putting all tasks to a single client can result > in > > 0 > > > cross AZ traffic. In > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-925%3A+Rack+aware+task+assignment+in+Kafka+Streams#KIP925:RackawaretaskassignmentinKafkaStreams-Algorithm > > > section, the algorithm will try to find a min-cost solution based on > > > balanced assignment instead of pure min-cost. > > > > > > Thanks, > > > Hao > > > > > > On Tue, May 9, 2023 at 5:55 PM Colt McNealy <c...@littlehorse.io> > wrote: > > > > > > > Hello Hao, > > > > > > > > First of all, THANK YOU for putting this together. I had been hoping > > > > someone might bring something like this forward. A few comments: > > > > > > > > **1: Runtime Complexity > > > > > Klein’s cycle canceling algorithm can solve the min-cost flow > > problem in > > > > O(E^2CU) time where C is max cost and U is max capacity. In our > > particular > > > > case, C is 1 and U is at most 3 (A task can have at most 3 topics > > including > > > > changelog topic?). So the algorithm runs in O(E^2) time for our case. > > > > > > > > A Task can have multiple input topics, and also multiple state > stores, > > and > > > > multiple output topics. The most common case is three topics as you > > > > described, but this is not necessarily guaranteed. Also, math is one > > of my > > > > weak points, but to me O(E^2) is equivalent to O(1), and I struggle > to > > see > > > > how the algorithm isn't at least O(N) where N is the number of > > Tasks...? > > > > > > > > **2: Broker-Side Partition Assignments > > > > Consider the case with just three topics in a Task (one input, one > > output, > > > > one changelog). If all three partition leaders are in the same Rack > (or > > > > better yet, the same broker), then we could get massive savings by > > > > assigning the Task to that Rack/availability zone. But if the leaders > > for > > > > each partition are spread across multiple zones, how will you handle > > that? > > > > Is that outside the scope of this KIP, or is it worth introducing a > > > > kafka-streams-generate-rebalance-proposal.sh tool? > > > > > > > > Colt McNealy > > > > *Founder, LittleHorse.io* > > > > > > > > > > > > On Tue, May 9, 2023 at 4:03 PM Hao Li <h...@confluent.io.invalid> > > wrote: > > > > > > > > > Hi all, > > > > > > > > > > I have submitted KIP-925 to add rack awareness logic in task > > assignment > > > > in > > > > > Kafka Streams and would like to start a discussion: > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-925%3A+Rack+aware+task+assignment+in+Kafka+Streams > > > > > > > > > > -- > > > > > Thanks, > > > > > Hao > > > > > > > > > > > > > > > > > > -- > > > Thanks, > > > Hao > > > > > > > > -- > Thanks, > Hao >