Hello,

I've got a problem with our flink cluster where the jobmanager is not
starting up anymore, because it tries to download non existant (blob) file
from the zookeeper storage dir.

We're running flink 1.8.0 on a kubernetes cluster and use the google
storage connector [1] to store checkpoints, savepoints and zookeeper data.

When I noticed the jobmanager was having problems, it was in a crashloop
throwing file not found exceptions [2]
Caused by: java.io.FileNotFoundException: Item not found:
some-project-flink-state/recovery/hunch/blob/job_e6ad857af7f09b56594e95fe273e9eff/blob_p-486d68fa98fa05665f341d79302c40566b81034e-306d493f5aa810b5f4f7d8d63f5b18b5.
If you enabled STRICT generation consistency, it is possible that the live
version is still available but the intended generation is deleted.

I looked in the blob directory and I can only find:
/recovery/hunch/blob/job_1dccee15d84e1d2cededf89758ac2482 I've tried to
fiddle around in zookeeper to see if I could find anything [3], but I do
not really know what to look for.

How could this have happened and how should I recover the job from this
situation?

Thanks,

Richard

[1]
https://ci.apache.org/projects/flink/flink-docs-stable/dev/batch/connectors.html#using-hadoop-file-system-implementations
[2] https://gist.github.com/Xeli/0321031655e47006f00d38fc4bc08e16
[3] https://gist.github.com/Xeli/04f6d861c5478071521ac6d2c582832a

Reply via email to