Hello, I've got a problem with our flink cluster where the jobmanager is not starting up anymore, because it tries to download non existant (blob) file from the zookeeper storage dir.
We're running flink 1.8.0 on a kubernetes cluster and use the google storage connector [1] to store checkpoints, savepoints and zookeeper data. When I noticed the jobmanager was having problems, it was in a crashloop throwing file not found exceptions [2] Caused by: java.io.FileNotFoundException: Item not found: some-project-flink-state/recovery/hunch/blob/job_e6ad857af7f09b56594e95fe273e9eff/blob_p-486d68fa98fa05665f341d79302c40566b81034e-306d493f5aa810b5f4f7d8d63f5b18b5. If you enabled STRICT generation consistency, it is possible that the live version is still available but the intended generation is deleted. I looked in the blob directory and I can only find: /recovery/hunch/blob/job_1dccee15d84e1d2cededf89758ac2482 I've tried to fiddle around in zookeeper to see if I could find anything [3], but I do not really know what to look for. How could this have happened and how should I recover the job from this situation? Thanks, Richard [1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/batch/connectors.html#using-hadoop-file-system-implementations [2] https://gist.github.com/Xeli/0321031655e47006f00d38fc4bc08e16 [3] https://gist.github.com/Xeli/04f6d861c5478071521ac6d2c582832a