Hi Fabian,

Thanks for collecting feedback. Here's the answers to your questions:

1. Yes, we enabled incremental checkpoints for our job by setting
`state.backend.incremental` to true. As for whether the checkpoint we
recover from is incremental or not, I'm not sure how to determine that.
It's whatever Flink does by default with incremental checkpoints enabled.

2. Yes this was on purpose, we had tuned our job to work well on SSDs. We
have also run jobs with those parameters unset and using defaults, and
still have the same OOM issues.

Thanks for the pointer, yes we've been looking at the RocksDB metrics. They
haven't indicated to us what the issue is yet.

On Wed, Oct 6, 2021 at 3:21 AM Fabian Paul <fabianp...@ververica.com> wrote:

> Hi Kevin,
>
> Sorry for the late reply. I collected some feedback from other folks and
> have two more questions.
>
> 1. Did you enable incremental checkpoints for your job and is the
> checkpoint you recover from incremental?
>
> 2. I saw in your configuration that you set
> `state.backend.rocksdb.block.cache-size` and
> `state.backend.rocksdb.predefined.options` by doing
>  so you overwrite the values Flink automatically sets. Can you confirm
> that this is on purpose? The value for block.cache-size seems to be very
> small.
>
> You can also enable the native RocksDb metrics [1] to get a more detail
> view of the RocksDb memory consumption but be carefully because it may
> degrade the performance of your job.
>
> Best,
> Fabian
>
> [1]
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#rocksdb-native-metrics
>
>
>

Reply via email to