[jira] [Comment Edited] (FLINK-22420) UnalignedCheckpointITCase failed

Yuan Mei (Jira) Mon, 21 Jun 2021 03:56:27 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-22420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366530#comment-17366530
 ]


Yuan Mei edited comment on FLINK-22420 at 6/21/21, 10:55 AM:
-------------------------------------------------------------

Hey [~trohrmann], thanks for replying!

RPC timeout failure rarely happens, as you can see, this test failure occurs 
nearly 2 months ago. 

What I want to point is there do exist some sets of tests (like this one, and 
some other race condition tests) that rely on the check with the expected 
number of failures.

My proposal of option2 is a way to "NOT count" or "ONLY count" certain types of 
failures in general conceptually. We would configure in the test cluster what 
types of exceptions are not/only counted. Even better, we can use a different 
RestartStrategy.

The question is whether it is worth/necessary to do it in the long term because 
overall the changes increase system complexity. That's why I ask the question.

But I agree that we can start from increase the "RPC timeout".




was (Author: ym):
Hey [~trohrmann], thanks for replying!

RPC timeout failure rarely happens, as you can see, this test failure occurs 
nearly 2 months ago. 

What I want to point is there do exist some sets of tests (like this one, and 
some other race condition tests) that rely on the check with the expected 
number of failures.

My proposal of option2 is a way to "NOT count" certain types of failures in 
general conceptually. We would configure in the test cluster what types of 
exceptions are not counted. Even better, we can use a different RestartStrategy.

The question is whether it is worth/necessary to do it in the long term because 
overall the changes increase system complexity. That's why I ask the question.

But I agree that we can start from increase the "RPC timeout".



> UnalignedCheckpointITCase failed
> --------------------------------
>
>                 Key: FLINK-22420
>                 URL: https://issues.apache.org/jira/browse/FLINK-22420
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.14.0
>            Reporter: Guowei Ma
>            Priority: Minor
>              Labels: auto-deprioritized-major, test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17052&view=logs&j=34f41360-6c0d-54d3-11a1-0292a2def1d9&t=2d56e022-1ace-542f-bf1a-b37dd63243f2&l=9442
> {code:java}
> Apr 22 14:28:21       at 
> akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> Apr 22 14:28:21       at 
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
> Apr 22 14:28:21       at 
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> Apr 22 14:28:21       at 
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> Apr 22 14:28:21       at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
> Apr 22 14:28:21       at 
> akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
> Apr 22 14:28:21       at 
> akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
> Apr 22 14:28:21       at akka.actor.ActorCell.invoke(ActorCell.scala:561)
> Apr 22 14:28:21       at 
> akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
> Apr 22 14:28:21       at akka.dispatch.Mailbox.run(Mailbox.scala:225)
> Apr 22 14:28:21       at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
> Apr 22 14:28:21       at 
> akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> Apr 22 14:28:21       at 
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> Apr 22 14:28:21       at 
> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> Apr 22 14:28:21       at 
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Apr 22 14:28:21 Caused by: org.apache.flink.util.FlinkException: An 
> OperatorEvent from an OperatorCoordinator to a task was lost. Triggering task 
> failover to ensure consistency. Event: '[NoMoreSplitEvent]', targetTask: 
> Source: source (1/1) - execution #5
> Apr 22 14:28:21       ... 26 more
> Apr 22 14:28:21 
> {code}
> As described in the comment 
> https://issues.apache.org/jira/browse/FLINK-21996?focusedCommentId=17326449&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17326449
>  we might need to adjust the tests  to allow failover.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-22420) UnalignedCheckpointITCase failed

Reply via email to