Re: [DISCUSS] KIP-925: rack aware task assignment in Kafka Streams

Hao Li Wed, 31 May 2023 14:31:34 -0700

Hi all,

I've updated the KIP based on the feedback. Major changes I made:
1. Add rack aware assignment to `StickyTaskAssignor`
2. Reject `Prefer reliability and then find optimal cost` option in standby
task assignment.



On Wed, May 31, 2023 at 12:09 PM Hao Li <h...@confluent.io> wrote:

> Hi all,
>
> Thanks for the feedback! I will update the KIP accordingly.
>
> *For Sophie's comments:*
>
> 1 and 2. Good catch. Fixed these.
>
> 3 and 4 Yes. We can make this public config and call out the
> clientConsumer config users need to set.
>
> 5. It's ideal to take the previous assignment in HAAssignor into
> consideration when we compute our target assignment, the complications come
> with making sure the assignment can eventually converge and we don't do
> probing rebalance infinitely. It's not only about storing the previous
> assignment or get it somehow. We can actually get the previous assignment
> now like we do in StickyAssignor. But the previous assignment will change
> in each round of probing rebalance. The proposal which added some weight to
> make the rack aware assignment lean towards the original HAA's target
> assignment will add benefits of stability in some corner cases in case of
> tie in cross rack traffic cost. But it's not sticky. But the bottom line is
> it won't be worse than current HAA's stickiness.
>
> 6. I'm fine with changing the assignor config to public. Actually, I think
> we can min-cost algorithm with StickyAssignor as well to mitigate the
> problem of 5. So we can have one public config to choose an assignor and
> one public config to enable the rack aware assignment.
>
> *For Bruno's comments:*
>
> The proposal was to implement all the options and use configs to choose
> them during runtime. We can make those configs public as suggested.
> 1, 2, 3, 4, 5: agree and will fix those.
> 6: subscription protocol is not changed.
> 7: yeah. Let me fix the notations.
> 8: It meant clients. In the figure, it maps to `c1_1`, `c1_2`, `c1_3` etc.
> 9: I'm also ok with just optimizing reliability for standby tasks. Or we
> could simply run the "balance reliability over cost" greedy algorithm to
> see if any cost could be reduced.
> 10: Make sense. Will fix the wording.
> 11: Make sense. Will update the test part.
>
> *For Walker's comments:*
> 1. Stability for HAA is an issue. See my comments for Sophie's feedback 5
> and 6. I think we could use the rack aware assignment for StickyAssignor as
> well. For HAA assignments, it's less sticky and we can only shoot for
> minimizing the cross rack traffic eventually when everything is stable.
> 2. Yeah. This is a good point and we can also turn it on for
> StickyAssignor.
>
> Thanks,
> Hao
>
>
> On Tue, May 30, 2023 at 2:28 PM Sophie Blee-Goldman <
> ableegold...@gmail.com> wrote:
>
>> Hey Hao, thanks for the KIP!
>>
>> 1. There's a typo in the "internal.rack.aware.assignment.strategry"
>> config,
>> this
>> should be internal.rack.aware.assignment.strategy.
>>
>> 2.
>>
>> >  For O(E^2 * (CU)) complexity, C and U can be viewed as constant.
>> Number of
>> > edges E is T * N where T is the number of clients and N is the number of
>> > Tasks. This is because a task can be assigned to any client so there
>> will
>> > be an edge between every task and every client. The total complexity
>> would
>> > be O(T * N) if we want to be more specific.
>>
>> I feel like I'm missing something here, but if E = T * N and the
>> complexity
>> is ~O(E^2), doesn't
>> this make the total complexity order of O(T^2 * N^2)?
>>
>> 3.
>>
>> > Since 3.C.I and 3.C.II have different tradeoffs and work better in
>> > different workloads etc, we
>>
>> could add an internal configuration to choose one of them at runtime.
>> >
>> Why only an internal configuration? Same goes for
>> internal.rack.aware.assignment.standby.strategry (which also has the typo)
>>
>> 4.
>>
>> >  There are no changes in public interfaces.
>>
>> I think it would be good to explicitly call out that users can utilize
>> this
>> new feature by setting the
>> ConsumerConfig's CLIENT_RACK_CONFIG, possibly with a brief example
>>
>> 5.
>>
>> > The idea is that if we always try to make it overlap as much with
>> > HAAssignor’s target
>>
>> assignment, at least there’s a higher chance that tasks won’t be shuffled
>> a
>> > lot if the clients
>>
>> remain the same across rebalances.
>> >
>> This line definitely gave me some pause -- if there was one major takeaway
>> I had after KIP-441,
>> one thing that most limited the feature's success, it was our assumption
>> that clients are relatively
>> stable across rebalances. This was mostly true at limited scale or for
>> on-prem setups, but
>> unsurprisingly broke down in cloud environments or larger clusters. Not
>> only do clients naturally
>> fall in and out of the group, autoscaling is becoming more and more of a
>> thing.
>>
>> Lastly, and this is more easily solved but still worth calling out, an
>> assignment is only deterministic
>> as long as the client.id is persisted. Currently in Streams, we only
>> write
>> the process UUID to the
>> state directory if there is one, ie if at least one persistent stateful
>> task exists in the topology. This
>> made sense in the context of KIP-441, which targeted heavily stateful
>> deployments, but this KIP
>> presumably intends to target more than just the persistent & stateful
>> subset of applications. To
>> make matters even worse,  "persistent" is defined in a semantically
>> inconsistent way throughout
>> Streams.
>>
>> All this is to say, it may sound more complicated to remember the previous
>> assignment, but (a)
>> imo it only introduces a lot more complexity and shaky assumptions to
>> continue down this
>> path, and (b) we actually already do persist some amount of state, like
>> the
>> process UUID, and
>> (c) it seems like this is the perfect opportunity to finally rid ourselves
>> of the determinism constraint
>> which has frankly caused more trouble and time lost in sum than it would
>> have taken us to just
>> write the HighAvailabilityTaskAssignor to consider the previous assignment
>> from the start in KIP-441
>>
>> 6.
>>
>> > StickyTaskAssignor  users who would like to use rack aware assignment
>> > should upgrade their
>>
>> Kafka Streams version to the version in which HighAvailabilityTaskAssignor
>> > and rack awareness
>>
>> assignment are available.
>>
>> Building off of the above, the HAAssignor hasn't worked out perfectly for
>> everybody up until now,
>> given that we are only adding complexity to it now, on the flipside I
>> would
>> hesitate to try and force
>> everyone to use it if they want to upgrade. We added a "secret" backdoor
>> internal config to allow
>> users to set the task assignor back in KIP-441 for this reason. WDYT about
>> bumping this to a public
>> config on the side in this KIP?
>>
>>
>> On Tue, May 23, 2023 at 11:46 AM Hao Li <h...@confluent.io.invalid> wrote:
>>
>> > Thanks John! Yeah. The ConvergenceTest looks very helpful. I will add
>> it to
>> > the test plan. I will also add tests to verify the new optimizer will
>> > produce a balanced assignment which has no worse cross AZ cost than the
>> > existing assignor.
>> >
>> > Hao
>> >
>> > On Mon, May 22, 2023 at 3:39 PM John Roesler <vvcep...@apache.org>
>> wrote:
>> >
>> > > Hi Hao,
>> > >
>> > > Thanks for the KIP!
>> > >
>> > > Overall, I think this is a great idea. I always wanted to circle back
>> > > after the Smooth Scaling KIP to put a proper optimization algorithm
>> into
>> > > place. I think this has the promise to really improve the quality of
>> the
>> > > balanced assignments we produce.
>> > >
>> > > Thanks for providing the details about the MaxCut/MinFlow algorithm.
>> It
>> > > seems like a good choice for me, assuming we choose the right scaling
>> > > factors for the weights we add to the graph. Unfortunately, I don't
>> think
>> > > that there's a good way to see how easy or hard this is going to be
>> until
>> > > we actually implement it and test it.
>> > >
>> > > That leads to the only real piece of feedback I had on the KIP, which
>> is
>> > > the testing portion. You mentioned system/integration/unit tests, but
>> > > there's not too much information about what those tests will do. I'd
>> like
>> > > to suggest that we invest in more simulation testing specifically,
>> > similar
>> > > to what we did in
>> > >
>> >
>> https://github.com/apache/kafka/blob/trunk/streams/src/test/java/org/apache/kafka/streams/processor/internals/assignment/TaskAssignorConvergenceTest.java
>> > > .
>> > >
>> > > In fact, it seems like we _could_ write the simulation up front, and
>> then
>> > > implement the algorithm in a dummy way and just see whether it passes
>> the
>> > > simulations or not, before actually integrating it with Kafka Streams.
>> > >
>> > > Basically, I'd be +1 on this KIP today, but I'd feel confident about
>> it
>> > if
>> > > we had a little more detail regarding how we are going to verify that
>> the
>> > > new optimizer is actually going to produce more optimal plans than the
>> > > existing assigner we have today.
>> > >
>> > > Thanks again!
>> > > -John
>> > >
>> > > On 2023/05/22 16:49:22 Hao Li wrote:
>> > > > Hi Colt,
>> > > >
>> > > > Thanks for the comments.
>> > > >
>> > > > > and I struggle to see how the algorithm isn't at least O(N) where
>> N
>> > is
>> > > > the number of Tasks...?
>> > > >
>> > > > For O(E^2 * (CU)) complexity, C and U can be viewed as constant.
>> Number
>> > > of
>> > > > edges E is T * N where T is the number of clients and N is the
>> number
>> > of
>> > > > Tasks. This is because a task can be assigned to any client so there
>> > will
>> > > > be an edge between every task and every client. The total complexity
>> > > would
>> > > > be O(T * N) if we want to be more specific.
>> > > >
>> > > > > But if the leaders for each partition are spread across multiple
>> > zones,
>> > > > how will you handle that?
>> > > >
>> > > > This is what the min-cost flow solution is trying to solve? i.e.
>> Find
>> > an
>> > > > assignment of tasks to clients where across AZ traffic can be
>> > minimized.
>> > > > But there are some constraints to the solution and one of them is we
>> > need
>> > > > to balance task assignment first (
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-925%3A+Rack+aware+task+assignment+in+Kafka+Streams#KIP925:RackawaretaskassignmentinKafkaStreams-Designforrackawareassignment
>> > > ).
>> > > > So in your example of three tasks' partitions being in the same AZ
>> of a
>> > > > client, if there are other clients, we still want to balance the
>> tasks
>> > to
>> > > > other clients even if putting all tasks to a single client can
>> result
>> > in
>> > > 0
>> > > > cross AZ traffic. In
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-925%3A+Rack+aware+task+assignment+in+Kafka+Streams#KIP925:RackawaretaskassignmentinKafkaStreams-Algorithm
>> > > > section, the algorithm will try to find a min-cost solution based on
>> > > > balanced assignment instead of pure min-cost.
>> > > >
>> > > > Thanks,
>> > > > Hao
>> > > >
>> > > > On Tue, May 9, 2023 at 5:55 PM Colt McNealy <c...@littlehorse.io>
>> > wrote:
>> > > >
>> > > > > Hello Hao,
>> > > > >
>> > > > > First of all, THANK YOU for putting this together. I had been
>> hoping
>> > > > > someone might bring something like this forward. A few comments:
>> > > > >
>> > > > > **1: Runtime Complexity
>> > > > > > Klein’s cycle canceling algorithm can solve the min-cost flow
>> > > problem in
>> > > > > O(E^2CU) time where C is max cost and U is max capacity. In our
>> > > particular
>> > > > > case, C is 1 and U is at most 3 (A task can have at most 3 topics
>> > > including
>> > > > > changelog topic?). So the algorithm runs in O(E^2) time for our
>> case.
>> > > > >
>> > > > > A Task can have multiple input topics, and also multiple state
>> > stores,
>> > > and
>> > > > > multiple output topics. The most common case is three topics as
>> you
>> > > > > described, but this is not necessarily guaranteed. Also, math is
>> one
>> > > of my
>> > > > > weak points, but to me O(E^2) is equivalent to O(1), and I
>> struggle
>> > to
>> > > see
>> > > > > how the algorithm isn't at least O(N) where N is the number of
>> > > Tasks...?
>> > > > >
>> > > > > **2: Broker-Side Partition Assignments
>> > > > > Consider the case with just three topics in a Task (one input, one
>> > > output,
>> > > > > one changelog). If all three partition leaders are in the same
>> Rack
>> > (or
>> > > > > better yet, the same broker), then we could get massive savings by
>> > > > > assigning the Task to that Rack/availability zone. But if the
>> leaders
>> > > for
>> > > > > each partition are spread across multiple zones, how will you
>> handle
>> > > that?
>> > > > > Is that outside the scope of this KIP, or is it worth introducing
>> a
>> > > > > kafka-streams-generate-rebalance-proposal.sh tool?
>> > > > >
>> > > > > Colt McNealy
>> > > > > *Founder, LittleHorse.io*
>> > > > >
>> > > > >
>> > > > > On Tue, May 9, 2023 at 4:03 PM Hao Li <h...@confluent.io.invalid>
>> > > wrote:
>> > > > >
>> > > > > > Hi all,
>> > > > > >
>> > > > > > I have submitted KIP-925 to add rack awareness logic in task
>> > > assignment
>> > > > > in
>> > > > > > Kafka Streams and would like to start a discussion:
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-925%3A+Rack+aware+task+assignment+in+Kafka+Streams
>> > > > > >
>> > > > > > --
>> > > > > > Thanks,
>> > > > > > Hao
>> > > > > >
>> > > > >
>> > > >
>> > > >
>> > > > --
>> > > > Thanks,
>> > > > Hao
>> > > >
>> > >
>> >
>> >
>> > --
>> > Thanks,
>> > Hao
>> >
>>
>
>
> --
> Thanks,
> Hao
>


-- 
Thanks,
Hao

Re: [DISCUSS] KIP-925: rack aware task assignment in Kafka Streams

Reply via email to