Tim Patterson created KAFKA-13600:
-------------------------------------

             Summary: Rebalances while streams is in degraded state can stores 
to be assigned and restore from scratch
                 Key: KAFKA-13600
                 URL: https://issues.apache.org/jira/browse/KAFKA-13600
             Project: Kafka
          Issue Type: Bug
          Components: streams
    Affects Versions: 3.0.0, 2.8.1
            Reporter: Tim Patterson


Consider this scenario:
 # A node is lost from the cluster.
 # A rebalance is kicked off with a new "target assignment"'s(ie the rebalance 
is attempting to move a lot of tasks - see 
https://issues.apache.org/jira/browse/KAFKA-10121).
 # The kafka cluster is now a bit more sluggish from the increased load.
 # A Rolling Deploy happens triggering rebalances, during the rebalance 
processing continues but offsets can't be committed(Or nodes are restarted but 
fail to commit offsets)
 # The most caught up nodes now aren't within `acceptableRecoveryLag` and so 
the task is started in it's "target assignment" location, restoring all state 
from scratch and delaying further processing instead of using the "almost 
caught up" node.

We've hit this a few times and having lots of state (~25TB worth) and being 
heavy users of IQ this is not ideal for us.

While we can increase `acceptableRecoveryLag` to larger values to try get 
around this that causes other issues (ie a warmup becoming active when its 
still quite far behind)



The solution seems to be to balance "balanced assignment" with "most caught up 
nodes".

We've got a fork where we do just this and it's made a huge difference to the 
reliability of our cluster.

Our change is to simply use the most caught up node if the "target node" is 
more than `acceptableRecoveryLag` behind.
This gives up some of the load balancing type behaviour of the existing code 
but in practise doesn't seem to matter too much.

I guess maybe an algorithm that identified candidate nodes as those being 
within `acceptableRecoveryLag` of the most caught up node might allow the best 
of both worlds.

 

Our fork is

[https://github.com/apache/kafka/compare/trunk...tim-patterson:fix_balance_uncaughtup?expand=1]
(We also moved the capacity constraint code to happen after all the stateful 
assignment to prioritise standby tasks over warmup tasks)

Ideally we don't want to maintain a fork of kafka streams going forward so are 
hoping to get a bit of discussion / agreement on the best way to handle this.
More than happy to contribute code/test different algo's in production system 
or anything else to help with this issue



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to