The source of the issue may be this error that occurred when the job was being canceled on June 5:
June 5th 2018, 14:59:59.430 Failure during cancellation of job c59dd3133b1182ce2c05a5e2603a0646 with savepoint. java.io.IOException: Failed to create savepoint directory at --checkpoint-dir at org.apache.flink.runtime.checkpoint.savepoint.SavepointStore.createSavepointDirectory(SavepointStore.java:106) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.triggerSavepoint(CheckpointCoordinator.java:376) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:577) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:38) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) at akka.actor.Actor$class.aroundReceive(Actor.scala:502) at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:122) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) at akka.actor.ActorCell.invoke(ActorCell.scala:495) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) at akka.dispatch.Mailbox.run(Mailbox.scala:224) at akka.dispatch.Mailbox.exec(Mailbox.scala:234) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) On Wed, Jun 20, 2018 at 9:31 AM Elias Levy <fearsome.lucid...@gmail.com> wrote: > We had an unusual situation last night. One of our Flink clusters > experienced some connectivity issues, with lead to the the single job > running on the cluster failing and then being restored. > > And then something odd happened. The cluster decided to also restore an > old version of the job. One we were running a month ago. That job was > canceled on June 5 with a savepoint: > > June 5th 2018, 15:00:43.865 Trying to cancel job > c59dd3133b1182ce2c05a5e2603a0646 with savepoint to > s3://bucket/flink/foo/savepoints > June 5th 2018, 15:00:44.438 Savepoint stored in > s3://bucket/flink/foo/savepoints/savepoint-c59dd3-f748765c67df. Now > cancelling c59dd3133b1182ce2c05a5e2603a0646. > June 5th 2018, 15:00:44.438 Job IOC Engine > (c59dd3133b1182ce2c05a5e2603a0646) switched from state RUNNING to > CANCELLING. > June 5th 2018, 15:00:44.495 Job IOC Engine > (c59dd3133b1182ce2c05a5e2603a0646) switched from state CANCELLING to > CANCELED. > June 5th 2018, 15:00:44.507 Removed job graph > c59dd3133b1182ce2c05a5e2603a0646 from ZooKeeper. > June 5th 2018, 15:00:44.508 Removing > /flink/foo/checkpoints/c59dd3133b1182ce2c05a5e2603a0646 from ZooKeeper > June 5th 2018, 15:00:44.732 Job c59dd3133b1182ce2c05a5e2603a0646 has been > archived at s3://bucket/flink/foo/archive/c59dd3133b1182ce2c05a5e2603a0646. > > But then yesterday: > > June 19th 2018, 17:55:31.917 Attempting to recover job > c59dd3133b1182ce2c05a5e2603a0646. > June 19th 2018, 17:55:32.155 Recovered > SubmittedJobGraph(c59dd3133b1182ce2c05a5e2603a0646, JobInfo(clients: > Set((Actor[akka.tcp://fl...@ip-10-201-11-121.eu-west-1.compute.internal:42823/temp/$c],DETACHED)), > start: 1524514537697)). > June 19th 2018, 17:55:32.157 Submitting job > c59dd3133b1182ce2c05a5e2603a0646 (Some Job) (Recovery). > June 19th 2018, 17:55:32.157 Using restart strategy > FixedDelayRestartStrategy(maxNumberRestartAttempts=2147483647, > delayBetweenRestartAttempts=30000) for c59dd3133b1182ce2c05a5e2603a0646. > June 19th 2018, 17:55:32.157 Submitting recovered job > c59dd3133b1182ce2c05a5e2603a0646. > June 19th 2018, 17:55:32.158 Running initialization on master for job > Some Job (c59dd3133b1182ce2c05a5e2603a0646). > June 19th 2018, 17:55:32.165 Initialized in > '/checkpoints/c59dd3133b1182ce2c05a5e2603a0646'. > June 19th 2018, 17:55:32.170 Job Some Job > (c59dd3133b1182ce2c05a5e2603a0646) switched from state CREATED to RUNNING. > June 19th 2018, 17:55:32.170 Scheduling job > c59dd3133b1182ce2c05a5e2603a0646 (Some Job). > > Anyone seen anything like this? Any ideas what the cause may have been? > > I am guessing that the state in ZK or S3 may have been somewhat corrupted > when the job was previously shutdown, and that when the cluster encountered > networking problems yesterday > that lead to the cancel and restore of the currently running job, the > restore logic scanned ZK or S3 looking for jobs to restore, came across the > old job with bad state and decided to bring it back to life. > > Any way to scan ZooKeeper or S3 for such jobs? >