[jira] [Commented] (FLINK-20672) notifyCheckpointAborted RPC failure can fail JM

Zakelly Lan (Jira) Wed, 08 Nov 2023 20:45:05 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-20672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17784274#comment-17784274
 ]


Zakelly Lan commented on FLINK-20672:
-------------------------------------

[~yunta] A fatal exit for uncaught exception is a relatively "safe" option, 
since the executor service may not know what to do when encountering errors. If 
we could strictly limit its use and stipulate that it should not affect the job 
junning, we could use another handler without failing the process.

After reading some code, IIUC, the 
{{[DefaultJobMasterServiceFactory|https://github.com/apache/flink/blob/eb4ae5d4e7d517300e98e632de95249dbdd22192/flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/JobMasterServiceLeadershipRunnerFactory.java#L101C1-L101C1]}}
 is using the io-executor to 
{{[createJobMasterService|https://github.com/apache/flink/blob/eb4ae5d4e7d517300e98e632de95249dbdd22192/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/factories/DefaultJobMasterServiceFactory.java#L101]}},
 which is essential for job running. And it leave all exceptions uncaught, 
which should be also changed if we decide to change the behavior of io executor.

Actually I have no preference changing this behavior or not, since maybe some 
"io" operations are fatal and most are not. This is a matter of regulations and 
contracts. I suggest we could do our best to catch the exception within a 
runnable task if we are sure this one should not have any side effects on the 
job. WDYT?

> notifyCheckpointAborted RPC failure can fail JM
> -----------------------------------------------
>
>                 Key: FLINK-20672
>                 URL: https://issues.apache.org/jira/browse/FLINK-20672
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.11.3, 1.12.0
>            Reporter: Roman Khachatryan
>            Assignee: Zakelly Lan
>            Priority: Not a Priority
>              Labels: auto-deprioritized-major, auto-deprioritized-minor, 
> pull-request-available
>
> Introduced in FLINK-8871, aborted RPC notifications are done asynchonously:
>  
> {code}
>       private void sendAbortedMessages(long checkpointId, long timeStamp) {
>               // send notification of aborted checkpoints asynchronously.
>               executor.execute(() -> {
>                       // send the "abort checkpoint" messages to necessary 
> vertices.
>                         // ..
>               });
>       }
> {code}
> However, the executor that eventually executes this request is created as 
> follows
> {code}
>               final ScheduledExecutorService futureExecutor = 
> Executors.newScheduledThreadPool(
>                               Hardware.getNumberCPUCores(),
>                               new ExecutorThreadFactory("jobmanager-future"));
> {code}
> ExecutorThreadFactory uses UncaughtExceptionHandler that exits JVM on error.
> cc: [~yunta]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-20672) notifyCheckpointAborted RPC failure can fail JM

Reply via email to