tzulitai edited a comment on pull request #13761: URL: https://github.com/apache/flink/pull/13761#issuecomment-716945671
@pnowojski since the root cause of this issue is that the timer services are incorrectly assuming that whatever written in raw keyed state is written by them (please see details in the description of this PR), the ideal solution is to include as metadata in checkpoints / savepoints a header indicating what was used to write to raw keyed state. This way, the timer service can know to safely skip restoring from raw keyed state if it wasn't written by them (there is only ever one writer to raw keyed state streams). However, we decided not to go with that approach because: - Adding such a header would require some backwards compatibility path for savepoint formats - Raw keyed state is not intended or advertised to be used by users at the moment. Moreover, if some user is really using raw keyed state right now, restoring from checkpoints would have always failed due to this issue. - In the long term, the heap-based timers should eventually by moved to the state backends as well and no longer used raw keyed state anyways. That's why we came up with this temporary workaround, with a flag that we expect power-users to set if they are using raw keyed state. Since the Stateful Functions project bumped into this, and this is the first time ever the issue was reported, we're expecting that StateFun is currently the only Flink user with raw keyed state and needs to set this flag. As an alternative to the `isUsingCustomRawKeyedState()` method in this PR, I also considered a configuration flag, say `state.backend.rocksdb.migrate-timers` to provide the exact same functionality across all operators in a job. I chose to go with `isUsingCustomRawKeyedState()` so that: - the flag is set closer to where it is needed - only operators that are using custom raw keyed state should skip timer restores - otherwise, using the global config flag, _all_ operators will either try to skip or read from raw keyed state. Either way, this is meant as an undocumented internal flag that is supposedly only used by StateFun. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org