[ https://issues.apache.org/jira/browse/FLINK-33565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated FLINK-33565: ----------------------------------- Labels: pull-request-available (was: ) > The concurrentExceptions doesn't work > ------------------------------------- > > Key: FLINK-33565 > URL: https://issues.apache.org/jira/browse/FLINK-33565 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.18.0, 1.17.1 > Reporter: Rui Fan > Assignee: Rui Fan > Priority: Major > Labels: pull-request-available > Fix For: 1.19.0 > > Attachments: screenshot-1.png > > > First of all, thanks to [~mapohl] for helping double-check in advance that > this was indeed a bug . > Displaying exception history in WebUI is supported in FLINK-6042. > h1. What's the concurrentExceptions? > When an execution fails due to an exception, other executions in the same > region will also restart, and the first Exception is rootException. If other > restarted executions also report Exception at this time, we hope to collect > these exceptions and Displayed to the user as concurrentExceptions. > h2. What's this bug? > The concurrentExceptions is always empty in production, even if other > executions report exception at very close times. > h1. Why doesn't it work? > If one job has all-to-all shuffle, this job only has one region, and this > region has a lot of executions. If one execution throw exception: > * JobMaster will mark the state as FAILED for this execution. > * The rest of executions of this region will be marked to CANCELING. > ** This call stack can be found at FLIP-364 > [part-4.2.3|https://cwiki.apache.org/confluence/display/FLINK/FLIP-364%3A+Improve+the+restart-strategy#FLIP364:Improvetherestartstrategy-4.2.3Detailedcodeforfull-failover] > > When these executions throw exception as well, it JobMaster will mark the > state from CANCELING to CANCELED instead of FAILED. > The CANCELED execution won't call FAILED logic, so their exceptions are > ignored. > Note: all reports are executed inside of JobMaster RPC thread, it's single > thread. So these reports are executed serially. So only one execution is > marked to FAILED, and the rest of executions will be marked to CANCELED later. > h1. How to fix it? > Offline discuss with [~mapohl] , we need to discuss with community should we > keep the concurrentExceptions first. > * If no, we can remove related logic directly > * If yew, we discuss how to fix it later. -- This message was sent by Atlassian Jira (v8.20.10#820010)