[ https://issues.apache.org/jira/browse/FLINK-23189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17376181#comment-17376181 ]
zlzhang0122 commented on FLINK-23189: ------------------------------------- sure, [~pnowojski] I have posted a attachment which record the exception thrown in Flink 1.10. CheckpointCoordinator#triggerCheckpoint() will call the startTriggeringCheckpoint() function, while this function will call the initializeCheckpoint() function, this function may throw an IOException(see [link|https://github.com/zlzhang0122/flink/blob/9e1cc0ac2bbf0a2e8fcf00e6730a10893d651590/flink-runtime/src/main/java/org/apache/flink/runtime/state/CheckpointStorageCoordinatorView.java#L83]). The IOException will produce a CheckpointFailureReason.TRIGGER_CHECKPOINT_FAILURE just like any other Exception, I think that IOException is caused by disk error or any other IO problem that can hardly be resumed, and maybe we should treat it a little more serious and let users know it faster rather than just log it. > Count and fail the task when the disk is error on JobManager > ------------------------------------------------------------ > > Key: FLINK-23189 > URL: https://issues.apache.org/jira/browse/FLINK-23189 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing > Affects Versions: 1.12.2, 1.13.1 > Reporter: zlzhang0122 > Priority: Major > Attachments: exception.txt > > > When the jobmanager disk is error and the triggerCheckpoint will throw a > IOException and fail, this will cause a TRIGGER_CHECKPOINT_FAILURE, but this > failure won't cause Job failed. Users can hardly find this error if he don't > see the JobManager logs. To avoid this case, I propose that we can figure out > these IOException case and increase the failureCounter which can fail the job > finally. -- This message was sent by Atlassian Jira (v8.3.4#803005)