Stephan Ewen created FLINK-22749:
------------------------------------

             Summary: Apply a robust default State Backend Configuration
                 Key: FLINK-22749
                 URL: https://issues.apache.org/jira/browse/FLINK-22749
             Project: Flink
          Issue Type: Improvement
          Components: Stateful Functions
            Reporter: Stephan Ewen


We should update the default state backend configuration with default settings 
that reflect lessons-learned about robust setups.

(1) Always use the RocksDB State Backend. That is already the case.

(2) Active Partitioned Index filters by default. This may cost some overhead in 
specific cases, but helps with massive performance regressions when we have too 
many ColumnFamilies (too many states) such that the cache can no longer hold 
all index files.

We need to add {{state.backend.rocksdb.memory.partitioned-index-filters: true}} 
to the config.

See FLINK-20496 for details.
There is a chance that Flink makes this the default in the future as well, then 
we could remove it again from the StateFun setup.

(3) Activate local recovery by default.

That should speed up the recovery of all non-hard-crashed TMs by a lot.
We need to configure
  - {{state.backend.local-recovery: true}}
  - {{taskmanager.state.local.root-dirs}} to some non-temp directory

For this to work reliably, we need a local directory that is not periodically 
wiped by the OS, so we should not rely on the default ({{/tmp}} directory, but 
set up a dedicated non-temp state directory.

Flink will probably make this the default in the future, but having a 
non-{{/tmp}} directory for the RocksDB and local snapshots makes still a lot of 
sense.

(4) Increase state transfer threads by default, to speed up state restores.

Add to the config: {{state.backend.rocksdb.checkpoint.transfer.thread.num: 8}}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to