[ https://issues.apache.org/jira/browse/FLINK-22420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17364758#comment-17364758 ]
Yuan Mei edited comment on FLINK-22420 at 6/17/21, 7:52 AM: ------------------------------------------------------------ The failure of this task is caused by "Recovery is suppressed by FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=5" where the additional failure is triggered by "An OperatorEvent from an OperatorCoordinator to a task was lost", as reported. The unaligned checkpoint IT case expected five failures in total: * <ol> * <li>After {@code m=1/4*n}, map fails. * <li>After {@code m=1/2*n}, snapshotState fails. * <li>After {@code m=3/4*n}, map fails and the corresponding recovery fails once. * <li>At the end, close fails once. * </ol> and it is a bit tricky to introduce potential more failures due to OperatorEvent loss; because the success of this test does rely on how many failures during it running. An easy short-term fix for this as I can think of is to removes "maxNumberRestartAttempts=5" constraints, with a bit of concern that it may hide other problems. was (Author: ym): The failure of this task is caused by "Recovery is suppressed by FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=5" where the additional failure is triggered by "An OperatorEvent from an OperatorCoordinator to a task was lost", as reported. The unaligned checkpoint IT case expected five failures in total: * <ol> * <li>After {@code m=1/4*n}, map fails. * <li>After {@code m=1/2*n}, snapshotState fails. * <li>After {@code m=3/4*n}, map fails and the corresponding recovery fails once. * <li>At the end, close fails once. * </ol> and it is a bit tricky to introduce potential more failures due to OperatorEvent loss; An easy short-term fix for this is to removes "maxNumberRestartAttempts=5" constraints, with a bit of concern that it may hide other problems. > UnalignedCheckpointITCase failed > -------------------------------- > > Key: FLINK-22420 > URL: https://issues.apache.org/jira/browse/FLINK-22420 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing > Affects Versions: 1.14.0 > Reporter: Guowei Ma > Priority: Minor > Labels: auto-deprioritized-major, test-stability > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17052&view=logs&j=34f41360-6c0d-54d3-11a1-0292a2def1d9&t=2d56e022-1ace-542f-bf1a-b37dd63243f2&l=9442 > {code:java} > Apr 22 14:28:21 at > akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) > Apr 22 14:28:21 at > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) > Apr 22 14:28:21 at > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > Apr 22 14:28:21 at > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > Apr 22 14:28:21 at akka.actor.Actor$class.aroundReceive(Actor.scala:517) > Apr 22 14:28:21 at > akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) > Apr 22 14:28:21 at > akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) > Apr 22 14:28:21 at akka.actor.ActorCell.invoke(ActorCell.scala:561) > Apr 22 14:28:21 at > akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) > Apr 22 14:28:21 at akka.dispatch.Mailbox.run(Mailbox.scala:225) > Apr 22 14:28:21 at akka.dispatch.Mailbox.exec(Mailbox.scala:235) > Apr 22 14:28:21 at > akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > Apr 22 14:28:21 at > akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > Apr 22 14:28:21 at > akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > Apr 22 14:28:21 at > akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Apr 22 14:28:21 Caused by: org.apache.flink.util.FlinkException: An > OperatorEvent from an OperatorCoordinator to a task was lost. Triggering task > failover to ensure consistency. Event: '[NoMoreSplitEvent]', targetTask: > Source: source (1/1) - execution #5 > Apr 22 14:28:21 ... 26 more > Apr 22 14:28:21 > {code} > As described in the comment > https://issues.apache.org/jira/browse/FLINK-21996?focusedCommentId=17326449&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17326449 > we might need to adjust the tests to allow failover. -- This message was sent by Atlassian Jira (v8.3.4#803005)