[ https://issues.apache.org/jira/browse/FLINK-22420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366530#comment-17366530 ]
Yuan Mei edited comment on FLINK-22420 at 6/21/21, 10:55 AM: ------------------------------------------------------------- Hey [~trohrmann], thanks for replying! RPC timeout failure rarely happens, as you can see, this test failure occurs nearly 2 months ago. What I want to point is there do exist some sets of tests (like this one, and some other race condition tests) that rely on the check with the expected number of failures. My proposal of option2 is a way to "NOT count" or "ONLY count" certain types of failures in general conceptually. We would configure in the test cluster what types of exceptions are not/only counted. Even better, we can use a different RestartStrategy. The question is whether it is worth/necessary to do it in the long term because overall the changes increase system complexity. That's why I ask the question. But I agree that we can start from increase the "RPC timeout". was (Author: ym): Hey [~trohrmann], thanks for replying! RPC timeout failure rarely happens, as you can see, this test failure occurs nearly 2 months ago. What I want to point is there do exist some sets of tests (like this one, and some other race condition tests) that rely on the check with the expected number of failures. My proposal of option2 is a way to "NOT count" certain types of failures in general conceptually. We would configure in the test cluster what types of exceptions are not counted. Even better, we can use a different RestartStrategy. The question is whether it is worth/necessary to do it in the long term because overall the changes increase system complexity. That's why I ask the question. But I agree that we can start from increase the "RPC timeout". > UnalignedCheckpointITCase failed > -------------------------------- > > Key: FLINK-22420 > URL: https://issues.apache.org/jira/browse/FLINK-22420 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing > Affects Versions: 1.14.0 > Reporter: Guowei Ma > Priority: Minor > Labels: auto-deprioritized-major, test-stability > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17052&view=logs&j=34f41360-6c0d-54d3-11a1-0292a2def1d9&t=2d56e022-1ace-542f-bf1a-b37dd63243f2&l=9442 > {code:java} > Apr 22 14:28:21 at > akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) > Apr 22 14:28:21 at > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) > Apr 22 14:28:21 at > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > Apr 22 14:28:21 at > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > Apr 22 14:28:21 at akka.actor.Actor$class.aroundReceive(Actor.scala:517) > Apr 22 14:28:21 at > akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) > Apr 22 14:28:21 at > akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) > Apr 22 14:28:21 at akka.actor.ActorCell.invoke(ActorCell.scala:561) > Apr 22 14:28:21 at > akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) > Apr 22 14:28:21 at akka.dispatch.Mailbox.run(Mailbox.scala:225) > Apr 22 14:28:21 at akka.dispatch.Mailbox.exec(Mailbox.scala:235) > Apr 22 14:28:21 at > akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > Apr 22 14:28:21 at > akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > Apr 22 14:28:21 at > akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > Apr 22 14:28:21 at > akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Apr 22 14:28:21 Caused by: org.apache.flink.util.FlinkException: An > OperatorEvent from an OperatorCoordinator to a task was lost. Triggering task > failover to ensure consistency. Event: '[NoMoreSplitEvent]', targetTask: > Source: source (1/1) - execution #5 > Apr 22 14:28:21 ... 26 more > Apr 22 14:28:21 > {code} > As described in the comment > https://issues.apache.org/jira/browse/FLINK-21996?focusedCommentId=17326449&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17326449 > we might need to adjust the tests to allow failover. -- This message was sent by Atlassian Jira (v8.3.4#803005)