Hi,

I’m getting an exception at stop-with-savepoint. The savepoint is still created 
but the job fails. I’d like to know what the implications and consequences of 
the failure are (having job configured as exactly once) and how can It be 
avoided.  Starting the job with that savepoint looks to work as expected.

Here is the exception:
2024-10-09 17:23:48 org.apache.flink.runtime.JobException: The failure is not 
recoverable
                at 
org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:155)
                at 
org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getGlobalFailureHandlingResult(ExecutionFailureHandler.java:126)
                at 
org.apache.flink.runtime.scheduler.DefaultScheduler.handleGlobalFailure(DefaultScheduler.java:328)
                at 
org.apache.flink.runtime.scheduler.stopwithsavepoint.StopWithSavepointTerminationHandlerImpl.terminateExceptionallyWithGlobalFailover(StopWithSavepointTerminationHandlerImpl.java:178)
                at 
org.apache.flink.runtime.scheduler.stopwithsavepoint.StopWithSavepointTerminationHandlerImpl.access$500(StopWithSavepointTerminationHandlerImpl.java:53)
                at 
org.apache.flink.runtime.scheduler.stopwithsavepoint.StopWithSavepointTerminationHandlerImpl$SavepointCreated.onAnyExecutionNotFinished(StopWithSavepointTerminationHandlerImpl.java:235)
                at 
org.apache.flink.runtime.scheduler.stopwithsavepoint.StopWithSavepointTerminationHandlerImpl.handleAnyExecutionNotFinished(StopWithSavepointTerminationHandlerImpl.java:150)
                at 
org.apache.flink.runtime.scheduler.stopwithsavepoint.StopWithSavepointTerminationHandlerImpl.handleExecutionsTermination(StopWithSavepointTerminationHandlerImpl.java:111)
                at 
java.base/java.util.concurrent.CompletableFuture$UniAccept.tryFire(Unknown 
Source)
                at 
java.base/java.util.concurrent.CompletableFuture$Completion.run(Unknown Source)
                at 
org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRunAsync$4(PekkoRpcActor.java:451)
                at 
org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)
                at 
org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRunAsync(PekkoRpcActor.java:451)
                at 
org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRpcMessage(PekkoRpcActor.java:218)
                at 
org.apache.flink.runtime.rpc.pekko.FencedPekkoRpcActor.handleRpcMessage(FencedPekkoRpcActor.java:85)
                at 
org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleMessage(PekkoRpcActor.java:168)
                at 
org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:33)
                at 
org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:29)
                at scala.PartialFunction.applyOrElse(PartialFunction.scala:127)
                at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126)
                at 
org.apache.pekko.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:29)
                at 
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:175)
                at 
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176)
                at 
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176)
                at org.apache.pekko.actor.Actor.aroundReceive(Actor.scala:547)
                at org.apache.pekko.actor.Actor.aroundReceive$(Actor.scala:545)
                at 
org.apache.pekko.actor.AbstractActor.aroundReceive(AbstractActor.scala:229)
                at 
org.apache.pekko.actor.ActorCell.receiveMessage(ActorCell.scala:590)
                at org.apache.pekko.actor.ActorCell.invoke(ActorCell.scala:557)
                at 
org.apache.pekko.dispatch.Mailbox.processMailbox(Mailbox.scala:280)
                at org.apache.pekko.dispatch.Mailbox.run(Mailbox.scala:241)
                at org.apache.pekko.dispatch.Mailbox.exec(Mailbox.scala:253)
                at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown 
Source)
                at 
java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown 
Source)
                at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown 
Source)
                at 
java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source)
                at 
java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source)
Caused by: 
org.apache.flink.runtime.scheduler.stopwithsavepoint.StopWithSavepointStoppingException:
 A savepoint has been created at: 
s3p://bucket/path/to/savepoints/savepoint-f0a4f8-0301aa307ec4, but the 
corresponding job f0a4f8fdfa0038f7818cdbac1212b681 failed during stopping. The 
savepoint is consistent, but might have uncommitted transactions. If you want 
to commit the transaction please restart a job from this savepoint.
                at 
org.apache.flink.runtime.scheduler.stopwithsavepoint.StopWithSavepointTerminationHandlerImpl.terminateExceptionallyWithGlobalFailover(StopWithSavepointTerminationHandlerImpl.java:169)
                ... 33 more

Reply via email to