Thanks John! Yeah. The ConvergenceTest looks very helpful. I will add it to the test plan. I will also add tests to verify the new optimizer will produce a balanced assignment which has no worse cross AZ cost than the existing assignor.
Hao On Mon, May 22, 2023 at 3:39 PM John Roesler <vvcep...@apache.org> wrote: > Hi Hao, > > Thanks for the KIP! > > Overall, I think this is a great idea. I always wanted to circle back > after the Smooth Scaling KIP to put a proper optimization algorithm into > place. I think this has the promise to really improve the quality of the > balanced assignments we produce. > > Thanks for providing the details about the MaxCut/MinFlow algorithm. It > seems like a good choice for me, assuming we choose the right scaling > factors for the weights we add to the graph. Unfortunately, I don't think > that there's a good way to see how easy or hard this is going to be until > we actually implement it and test it. > > That leads to the only real piece of feedback I had on the KIP, which is > the testing portion. You mentioned system/integration/unit tests, but > there's not too much information about what those tests will do. I'd like > to suggest that we invest in more simulation testing specifically, similar > to what we did in > https://github.com/apache/kafka/blob/trunk/streams/src/test/java/org/apache/kafka/streams/processor/internals/assignment/TaskAssignorConvergenceTest.java > . > > In fact, it seems like we _could_ write the simulation up front, and then > implement the algorithm in a dummy way and just see whether it passes the > simulations or not, before actually integrating it with Kafka Streams. > > Basically, I'd be +1 on this KIP today, but I'd feel confident about it if > we had a little more detail regarding how we are going to verify that the > new optimizer is actually going to produce more optimal plans than the > existing assigner we have today. > > Thanks again! > -John > > On 2023/05/22 16:49:22 Hao Li wrote: > > Hi Colt, > > > > Thanks for the comments. > > > > > and I struggle to see how the algorithm isn't at least O(N) where N is > > the number of Tasks...? > > > > For O(E^2 * (CU)) complexity, C and U can be viewed as constant. Number > of > > edges E is T * N where T is the number of clients and N is the number of > > Tasks. This is because a task can be assigned to any client so there will > > be an edge between every task and every client. The total complexity > would > > be O(T * N) if we want to be more specific. > > > > > But if the leaders for each partition are spread across multiple zones, > > how will you handle that? > > > > This is what the min-cost flow solution is trying to solve? i.e. Find an > > assignment of tasks to clients where across AZ traffic can be minimized. > > But there are some constraints to the solution and one of them is we need > > to balance task assignment first ( > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-925%3A+Rack+aware+task+assignment+in+Kafka+Streams#KIP925:RackawaretaskassignmentinKafkaStreams-Designforrackawareassignment > ). > > So in your example of three tasks' partitions being in the same AZ of a > > client, if there are other clients, we still want to balance the tasks to > > other clients even if putting all tasks to a single client can result in > 0 > > cross AZ traffic. In > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-925%3A+Rack+aware+task+assignment+in+Kafka+Streams#KIP925:RackawaretaskassignmentinKafkaStreams-Algorithm > > section, the algorithm will try to find a min-cost solution based on > > balanced assignment instead of pure min-cost. > > > > Thanks, > > Hao > > > > On Tue, May 9, 2023 at 5:55 PM Colt McNealy <c...@littlehorse.io> wrote: > > > > > Hello Hao, > > > > > > First of all, THANK YOU for putting this together. I had been hoping > > > someone might bring something like this forward. A few comments: > > > > > > **1: Runtime Complexity > > > > Klein’s cycle canceling algorithm can solve the min-cost flow > problem in > > > O(E^2CU) time where C is max cost and U is max capacity. In our > particular > > > case, C is 1 and U is at most 3 (A task can have at most 3 topics > including > > > changelog topic?). So the algorithm runs in O(E^2) time for our case. > > > > > > A Task can have multiple input topics, and also multiple state stores, > and > > > multiple output topics. The most common case is three topics as you > > > described, but this is not necessarily guaranteed. Also, math is one > of my > > > weak points, but to me O(E^2) is equivalent to O(1), and I struggle to > see > > > how the algorithm isn't at least O(N) where N is the number of > Tasks...? > > > > > > **2: Broker-Side Partition Assignments > > > Consider the case with just three topics in a Task (one input, one > output, > > > one changelog). If all three partition leaders are in the same Rack (or > > > better yet, the same broker), then we could get massive savings by > > > assigning the Task to that Rack/availability zone. But if the leaders > for > > > each partition are spread across multiple zones, how will you handle > that? > > > Is that outside the scope of this KIP, or is it worth introducing a > > > kafka-streams-generate-rebalance-proposal.sh tool? > > > > > > Colt McNealy > > > *Founder, LittleHorse.io* > > > > > > > > > On Tue, May 9, 2023 at 4:03 PM Hao Li <h...@confluent.io.invalid> > wrote: > > > > > > > Hi all, > > > > > > > > I have submitted KIP-925 to add rack awareness logic in task > assignment > > > in > > > > Kafka Streams and would like to start a discussion: > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-925%3A+Rack+aware+task+assignment+in+Kafka+Streams > > > > > > > > -- > > > > Thanks, > > > > Hao > > > > > > > > > > > > > -- > > Thanks, > > Hao > > > -- Thanks, Hao