Hi, from my understanding of the code [1], the task scheduling first considers the state location, and then uses the evenly spread out scheduling strategy as a fall back. So in my understanding of the code, the local recovery should have preference over the evenly spread out strategy.
If you can easily test it, I would still recommend removing the "cluster.evenly-spread-out-slots" strategy, just to make sure my understanding is really correct. I don't think that's the case, but just to make sure: You are only restarting a single task manager, right? The other task managers keep running? (Afaik the state information is lost of a TaskManager restarts) Sorry that I don't have a real answer here (yet). Is there anything suspicious in the JobManager or TaskManager logs? [1] https://github.com/apache/flink/blob/release-1.10/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/slotpool/PreviousAllocationSlotSelectionStrategy.java On Wed, Sep 8, 2021 at 9:44 PM Xiang Zhang <xi...@stripe.com> wrote: > We also have this configuration set in case it makes any difference when > allocation tasks: cluster.evenly-spread-out-slots. > > On 2021/09/08 18:09:52, Xiang Zhang <xi...@stripe.com> wrote: > > Hello, > > > > We have an app running on Flink 1.10.2 deployed in standalone mode. We > > enabled task-local recovery by setting both > *state.backend.local-recovery *and > > *state.backend.rocksdb.localdir*. The app has over 100 task managers and > 2 > > job managers (active and passive). > > > > This is what we have observed. When we restarted a task manager, all > tasks > > got canceled (due to the default failover configuration > > < > https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/execution/task_failure_recovery/#failover-strategies > >). > > Then these tasks were re-distributed among the task manager (e.g. some > > tasks manager have more slots used than before restart). This caused all > > task managers to download state from remote storage all over again. > > > > The same thing happened when we restarted a job manager. The job manager > > failed over to the passive one successfully, however all tasks were > > canceled and reallocated among the task managers again. > > > > My understanding is that if task-local recovery is enabled, Flink will > try > > to enable sticky assignment of tasks to previous task managers they run > on. > > This doesn't seem to be the case. My question is how we can enable > > this allocation-preserving > > scheduling > > < > https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/state/large_state_tuning/#allocation-preserving-scheduling > > > > when Flink handles failures. > > > > Thanks, > > Xiang > > >