Question about setting up Task-local recovery with a RocksDB state backend

Sonam Mandal Thu, 01 Apr 2021 09:39:06 -0700

Hello,

I've been going through the documentation for task-local recovery and came 
across this 
section<https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/large_state_tuning.html#details-on-task-local-recovery-for-different-state-backends>
 which discusses that with incremental checkpoints enabled the task-local 
recovery incurs no additional storage cost. The caveat mentioned indicates that 
the task local recovery state and all the rocks DB local state must be on a 
single physical device to allow the use of hard links. I wanted to understand 
how to ensure that our RocksDB local state is on the same physical device as 
the task-local recovery data.


I came across a couple of config options we can set to point the RocksDB local 
state to a directory of our choosing, along with the task local recovery 
directory. Do I need to set both up for task local recovery to work correctly? 
What are the default paths if I don't set up these configs? (we are using 
Kubernetes - assume that /opt/flink/local-state below corresponds to a given 
physical drive)


    state.backend.rocksdb.localdir: /opt/flink/local-state/rocksdblocaldir

    taskmanager.state.local.root-dirs: /opt/flink/local-state/tasklocaldir

Do these configs make any difference if we turn off incremental checkpointing 
for RocksDB? Also, setting up this localdir for RocksDB won't affect 
checkpointing and where the checkpoints are stored, right?

After setting up the above two configs, I ran into some issues where the job 
would just disappear (or fail) if the Task Manager pod got killed (whereas 
without this, the job resumed correctly from the last checkpoint after the task 
manager pod was killed).

Thanks,
Sonam

Question about setting up Task-local recovery with a RocksDB state backend

Reply via email to