Hello Sophie, Thanks for the proposed KIP. I left some comments on the wiki itself, and I think I'm still not very clear on a couple or those:
1. With this proposal, does that mean with num.standby.replicas == 0, we may sometimes still have some standby tasks which may violate the config? 2. I think I understand the rationale to consider lags that is below the specified threshold to be equal, rather than still considering 5000 is better than 5001 -- we do not want to "over-optimize" and potentially falls into endless rebalances back and forth. But I'm not clear about the rationale of the second parameter of constrainedBalancedAssignment(StatefulTasksToRankedCandidates, balance_factor): Does that mean, e.g. with balance_factor of 3, we'd consider two assignments one resulting balance_factor 0 and one resulting balance_factor 3 to be equally optimized assignment and therefore may "stop early"? This was not very convincing to me :P 3. There are a couple of minor comments about the algorithm itself, left on the wiki page since it needs to refer to the exact line and better displayed there. 3.a Another wild thought about the threshold itself: today the assignment itself is memoryless, so we would not know if the reported `TaskLag` itself is increasing or decreasing even if the current value is under the threshold. I wonder if it worthy to make it a bit more complicated to track task lag trend at the assignor? Practically it may not be very uncommon that stand-by tasks are not keeping up due to the fact that other active tasks hosted on the same thread is starving the standby tasks. 4. There's a potential race condition risk when reporting `TaskLags` in the subscription: right after reporting it to the leader, the cleanup thread kicks in and deletes the state directory. If the task was assigned to the host it would cause it to restore from beginning and effectively make the seemingly optimized assignment very sub-optimal. To be on the safer side we should consider either prune out those tasks that are "close to be cleaned up" in the subscription, or we should delay the cleanup right after we've included them in the subscription in case they are been selected as assigned tasks by the assignor. 5. This is a meta comment: I think it would be helpful to add some user visibility on the standby tasks lagging as well, via metrics for example. Today it is hard for us to observe how far are our current standby tasks compared to the active tasks and whether that lag is being increasing or decreasing. As a follow-up task, for example, the rebalance should also be triggered if we realize that some standby task's lag is increasing indefinitely means that it cannot keep up (which is another indicator either you need to add more resources with the num.standbys or your are still not balanced enough). On Tue, Aug 6, 2019 at 1:32 PM Sophie Blee-Goldman <sop...@confluent.io> wrote: > Hey all, > > I'd like to kick off discussion on KIP-441, aimed at the long restore times > in Streams during which further active processing and IQ are blocked. > Please give it a read and let us know your thoughts > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-441:+Smooth+Scaling+Out+for+Kafka+Streams > > Cheers, > Sophie > -- -- Guozhang