Re: Cluster resurrects old job

Elias Levy Wed, 20 Jun 2018 10:04:36 -0700

The source of the issue may be this error that occurred when the job was
being canceled on June 5:


June 5th 2018, 14:59:59.430 Failure during cancellation of job
c59dd3133b1182ce2c05a5e2603a0646 with savepoint.
java.io.IOException: Failed to create savepoint directory at
--checkpoint-dir
at
org.apache.flink.runtime.checkpoint.savepoint.SavepointStore.createSavepointDirectory(SavepointStore.java:106)
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.triggerSavepoint(CheckpointCoordinator.java:376)
at
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:577)
at
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at
org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:38)
at
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at
org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at
org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:122)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

On Wed, Jun 20, 2018 at 9:31 AM Elias Levy <fearsome.lucid...@gmail.com>
wrote:

> We had an unusual situation last night.  One of our Flink clusters
> experienced some connectivity issues, with lead to the the single job
> running on the cluster failing and then being restored.
>
> And then something odd happened.  The cluster decided to also restore an
> old version of the job.  One we were running a month ago.  That job was
> canceled on June 5 with a savepoint:
>
> June 5th 2018, 15:00:43.865 Trying to cancel job
> c59dd3133b1182ce2c05a5e2603a0646 with savepoint to
> s3://bucket/flink/foo/savepoints
> June 5th 2018, 15:00:44.438 Savepoint stored in
> s3://bucket/flink/foo/savepoints/savepoint-c59dd3-f748765c67df. Now
> cancelling c59dd3133b1182ce2c05a5e2603a0646.
> June 5th 2018, 15:00:44.438 Job IOC Engine
> (c59dd3133b1182ce2c05a5e2603a0646) switched from state RUNNING to
> CANCELLING.
> June 5th 2018, 15:00:44.495 Job IOC Engine
> (c59dd3133b1182ce2c05a5e2603a0646) switched from state CANCELLING to
> CANCELED.
> June 5th 2018, 15:00:44.507 Removed job graph
> c59dd3133b1182ce2c05a5e2603a0646 from ZooKeeper.
> June 5th 2018, 15:00:44.508 Removing
> /flink/foo/checkpoints/c59dd3133b1182ce2c05a5e2603a0646 from ZooKeeper
> June 5th 2018, 15:00:44.732 Job c59dd3133b1182ce2c05a5e2603a0646 has been
> archived at s3://bucket/flink/foo/archive/c59dd3133b1182ce2c05a5e2603a0646.
>
> But then yesterday:
>
> June 19th 2018, 17:55:31.917 Attempting to recover job
> c59dd3133b1182ce2c05a5e2603a0646.
> June 19th 2018, 17:55:32.155 Recovered
> SubmittedJobGraph(c59dd3133b1182ce2c05a5e2603a0646, JobInfo(clients:
> Set((Actor[akka.tcp://fl...@ip-10-201-11-121.eu-west-1.compute.internal:42823/temp/$c],DETACHED)),
> start: 1524514537697)).
> June 19th 2018, 17:55:32.157 Submitting job
> c59dd3133b1182ce2c05a5e2603a0646 (Some Job) (Recovery).
> June 19th 2018, 17:55:32.157 Using restart strategy
> FixedDelayRestartStrategy(maxNumberRestartAttempts=2147483647,
> delayBetweenRestartAttempts=30000) for c59dd3133b1182ce2c05a5e2603a0646.
> June 19th 2018, 17:55:32.157 Submitting recovered job
> c59dd3133b1182ce2c05a5e2603a0646.
> June 19th 2018, 17:55:32.158 Running initialization on master for job
> Some Job (c59dd3133b1182ce2c05a5e2603a0646).
> June 19th 2018, 17:55:32.165 Initialized in
> '/checkpoints/c59dd3133b1182ce2c05a5e2603a0646'.
> June 19th 2018, 17:55:32.170 Job Some Job
> (c59dd3133b1182ce2c05a5e2603a0646) switched from state CREATED to RUNNING.
> June 19th 2018, 17:55:32.170 Scheduling job
> c59dd3133b1182ce2c05a5e2603a0646 (Some Job).
>
> Anyone seen anything like this?  Any ideas what the cause may have been?
>
> I am guessing that the state in ZK or S3 may have been somewhat corrupted
> when the job was previously shutdown, and that when the cluster encountered
> networking problems yesterday
> that lead to the cancel and restore of the currently running job, the
> restore logic scanned ZK or S3 looking for jobs to restore, came across the
> old job with bad state and decided to bring it back to life.
>
> Any way to scan ZooKeeper or S3 for such jobs?
>

Re: Cluster resurrects old job

Reply via email to