Hello,

We've been seeing an issue with several Flink 1.5.4 clusters that looks
like this:

1. Job is cancelled with a savepoint
2. The jar is deleted from our HA blobstore (S3)
3. The jobgraph in ZK is *not* deleted
4. We restart the cluster
5. Startup fails in recovery because the jar is not available, with the
stacktrace:

00:13:58.486 ERROR o.a.f.r.e.ClusterEntrypoint - Fatal error occurred in
the cluster entrypoint.
{{ java.lang.RuntimeException:
org.apache.flink.runtime.client.JobExecutionException: Could not set up
JobManager}}
{{ at
org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:36)}}
{{ at
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)}}
{{ at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)}}
{{ at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)}}
{{ at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)}}
{{ at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)}}
{{ at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)}}
{{ at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)}}Caused
by: java.lang.Exception: Cannot set up the user code libraries: No such
file or directory:
s3://streamingplatform-production/{JOB_NAME}/flink/highavailability/{JOB_NAME}/blob/job_5a3fe2c00c05efd3a552a1c6707d2c10/blob_p-6d585831f5c947335ac505b400cf8f3630cc706a-42355c2885b668b0bc5e15b856141b0

This superficially seems similar to several issues that have apparently
been fixed in 1.5.4, like FLINK-10255 and FLINK-10184.

Has anybody else seen this issue on 1.5.4 (or later) clusters? Or any
advice for debugging?

Thanks,
Micah

Reply via email to