Hello, We've been seeing an issue with several Flink 1.5.4 clusters that looks like this:
1. Job is cancelled with a savepoint 2. The jar is deleted from our HA blobstore (S3) 3. The jobgraph in ZK is *not* deleted 4. We restart the cluster 5. Startup fails in recovery because the jar is not available, with the stacktrace: 00:13:58.486 ERROR o.a.f.r.e.ClusterEntrypoint - Fatal error occurred in the cluster entrypoint. {{ java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager}} {{ at org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:36)}} {{ at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)}} {{ at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)}} {{ at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)}} {{ at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)}} {{ at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)}} {{ at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)}} {{ at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)}}Caused by: java.lang.Exception: Cannot set up the user code libraries: No such file or directory: s3://streamingplatform-production/{JOB_NAME}/flink/highavailability/{JOB_NAME}/blob/job_5a3fe2c00c05efd3a552a1c6707d2c10/blob_p-6d585831f5c947335ac505b400cf8f3630cc706a-42355c2885b668b0bc5e15b856141b0 This superficially seems similar to several issues that have apparently been fixed in 1.5.4, like FLINK-10255 and FLINK-10184. Has anybody else seen this issue on 1.5.4 (or later) clusters? Or any advice for debugging? Thanks, Micah