Rui Fan created FLINK-33565:
-------------------------------

             Summary: The concurrentExceptions doesn't work
                 Key: FLINK-33565
                 URL: https://issues.apache.org/jira/browse/FLINK-33565
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination
    Affects Versions: 1.17.1, 1.18.0
            Reporter: Rui Fan
            Assignee: Rui Fan


First of all, thanks to [~mapohl] for helping double-check in advance that this 
was indeed a bug .

Displaying exception history in WebUI is supported in FLINK-6042.
h1. What's the concurrentExceptions?

When an execution fails due to an exception, other executions in the same 
region will also restart, and the first Exception is rootException. If other 
restarted executions also report Exception at this time, we hope to collect 
these exceptions and Displayed to the user as concurrentExceptions.
h2. What's this bug?

The concurrentExceptions is always empty in production, even if other 
executions report exception at very close times.
h1. Why doesn't it work?

If one job has all-to-all shuffle, this job only has one region, and this 
region has a lot of executions. If one execution throw exception:
 * JobMaster will mark the state as FAILED for this execution.
 * The rest of executions of this region will be marked to CANCELING.
 ** This call stack can be found at FLIP-364 
[part-4.2.3|https://cwiki.apache.org/confluence/display/FLINK/FLIP-364%3A+Improve+the+restart-strategy#FLIP364:Improvetherestartstrategy-4.2.3Detailedcodeforfull-failover]
 

When these executions throw exception as well, it JobMaster will mark the state 
from CANCELING to CANCELED instead of FAILED.

The CANCELED execution won't call FAILED logic, so their exceptions are ignored.

Note: all reports are executed inside of JobMaster RPC thread, it's single 
thread. So these reports are executed serially. So only one execution is marked 
to FAILED, and the rest of executions will be marked to CANCELED later.
h1. How to fix it?

Offline discuss with [~mapohl] , we need to discuss with community should we 
keep the concurrentExceptions first.
 * If no, we can remove related logic directly
 * If yew, we discuss how to fix it later.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to