Re: [DISCUSS] KIP-925: rack aware task assignment in Kafka Streams

John Roesler Mon, 22 May 2023 15:39:04 -0700

Hi Hao,

Thanks for the KIP!


Overall, I think this is a great idea. I always wanted to circle back after the 
Smooth Scaling KIP to put a proper optimization algorithm into place. I think 
this has the promise to really improve the quality of the balanced assignments 
we produce.

Thanks for providing the details about the MaxCut/MinFlow algorithm. It seems 
like a good choice for me, assuming we choose the right scaling factors for the 
weights we add to the graph. Unfortunately, I don't think that there's a good 
way to see how easy or hard this is going to be until we actually implement it 
and test it.

That leads to the only real piece of feedback I had on the KIP, which is the 
testing portion. You mentioned system/integration/unit tests, but there's not 
too much information about what those tests will do. I'd like to suggest that 
we invest in more simulation testing specifically, similar to what we did in 
https://github.com/apache/kafka/blob/trunk/streams/src/test/java/org/apache/kafka/streams/processor/internals/assignment/TaskAssignorConvergenceTest.java
 .

In fact, it seems like we _could_ write the simulation up front, and then 
implement the algorithm in a dummy way and just see whether it passes the 
simulations or not, before actually integrating it with Kafka Streams.

Basically, I'd be +1 on this KIP today, but I'd feel confident about it if we 
had a little more detail regarding how we are going to verify that the new 
optimizer is actually going to produce more optimal plans than the existing 
assigner we have today.

Thanks again!
-John

On 2023/05/22 16:49:22 Hao Li wrote:
> Hi Colt,
> 
> Thanks for the comments.
> 
> > and I struggle to see how the algorithm isn't at least O(N) where N is
> the number of Tasks...?
> 
> For O(E^2 * (CU)) complexity, C and U can be viewed as constant. Number of
> edges E is T * N where T is the number of clients and N is the number of
> Tasks. This is because a task can be assigned to any client so there will
> be an edge between every task and every client. The total complexity would
> be O(T * N) if we want to be more specific.
> 
> > But if the leaders for each partition are spread across multiple zones,
> how will you handle that?
> 
> This is what the min-cost flow solution is trying to solve? i.e. Find an
> assignment of tasks to clients where across AZ traffic can be minimized.
> But there are some constraints to the solution and one of them is we need
> to balance task assignment first (
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-925%3A+Rack+aware+task+assignment+in+Kafka+Streams#KIP925:RackawaretaskassignmentinKafkaStreams-Designforrackawareassignment).
> So in your example of three tasks' partitions being in the same AZ of a
> client, if there are other clients, we still want to balance the tasks to
> other clients even if putting all tasks to a single client can result in 0
> cross AZ traffic. In
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-925%3A+Rack+aware+task+assignment+in+Kafka+Streams#KIP925:RackawaretaskassignmentinKafkaStreams-Algorithm
> section, the algorithm will try to find a min-cost solution based on
> balanced assignment instead of pure min-cost.
> 
> Thanks,
> Hao
> 
> On Tue, May 9, 2023 at 5:55 PM Colt McNealy <c...@littlehorse.io> wrote:
> 
> > Hello Hao,
> >
> > First of all, THANK YOU for putting this together. I had been hoping
> > someone might bring something like this forward. A few comments:
> >
> > **1: Runtime Complexity
> > > Klein’s cycle canceling algorithm can solve the min-cost flow problem in
> > O(E^2CU) time where C is max cost and U is max capacity. In our particular
> > case, C is 1 and U is at most 3 (A task can have at most 3 topics including
> > changelog topic?). So the algorithm runs in O(E^2) time for our case.
> >
> > A Task can have multiple input topics, and also multiple state stores, and
> > multiple output topics. The most common case is three topics as you
> > described, but this is not necessarily guaranteed. Also, math is one of my
> > weak points, but to me O(E^2) is equivalent to O(1), and I struggle to see
> > how the algorithm isn't at least O(N) where N is the number of Tasks...?
> >
> > **2: Broker-Side Partition Assignments
> > Consider the case with just three topics in a Task (one input, one output,
> > one changelog). If all three partition leaders are in the same Rack (or
> > better yet, the same broker), then we could get massive savings by
> > assigning the Task to that Rack/availability zone. But if the leaders for
> > each partition are spread across multiple zones, how will you handle that?
> > Is that outside the scope of this KIP, or is it worth introducing a
> > kafka-streams-generate-rebalance-proposal.sh tool?
> >
> > Colt McNealy
> > *Founder, LittleHorse.io*
> >
> >
> > On Tue, May 9, 2023 at 4:03 PM Hao Li <h...@confluent.io.invalid> wrote:
> >
> > > Hi all,
> > >
> > > I have submitted KIP-925 to add rack awareness logic in task assignment
> > in
> > > Kafka Streams and would like to start a discussion:
> > >
> > >
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-925%3A+Rack+aware+task+assignment+in+Kafka+Streams
> > >
> > > --
> > > Thanks,
> > > Hao
> > >
> >
> 
> 
> -- 
> Thanks,
> Hao
>

Re: [DISCUSS] KIP-925: rack aware task assignment in Kafka Streams

Reply via email to