Hi Kevin, Sorry for the late reply. I collected some feedback from other folks and have two more questions.
1. Did you enable incremental checkpoints for your job and is the checkpoint you recover from incremental? 2. I saw in your configuration that you set `state.backend.rocksdb.block.cache-size` and `state.backend.rocksdb.predefined.options` by doing so you overwrite the values Flink automatically sets. Can you confirm that this is on purpose? The value for block.cache-size seems to be very small. You can also enable the native RocksDb metrics [1] to get a more detail view of the RocksDb memory consumption but be carefully because it may degrade the performance of your job. Best, Fabian [1] https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#rocksdb-native-metrics