pnowojski edited a comment on pull request #16905: URL: https://github.com/apache/flink/pull/16905#issuecomment-903619167
Thanks @gaoyunhaii for taking care of this issue. At this moment I can not do the full review, but on the high level I agree with your proposed solution. Regarding: > Do we need assert no exceptions ? > > I think perhaps we do not need to assert no exceptions here? If the task is still running when triggering, the trigger would be expected to succeed, but if the task is finished / canceled / failed, the trigger would be expected to fail. Currently we already have logic to deal with the second case that the checkpoint would be declined and in JM side a global failover would be triggered to keep consistent. Thus it seems to me we could remove the assertion ? otherwise the trigger failure would cause the TM process to exit, which might not be expected. We need to handle those exceptions in some way. They can not be ignored. We have learned a [hard way](https://issues.apache.org/jira/browse/FLINK-18137), that at the very least, if no exception should happen in a chained futures, such assertion should be added. Otherwise if there is a really unexpected error, like NPE or even a JVM bug, system should know about them, instead of for example deadlock. So one way or another, all exceptions should be handled. For example: - trigger task failover - decline checkpoint - log error/warning/debug message (that one would have to be carefully thought through, [but it has been done in other places](https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L604:L642)) - if none of the above makes sense, crashing JVM is the way to go. > If the task is still running when triggering, the trigger would be expected to succeed, but if the task is finished / canceled / failed, the trigger would be expected to fail In that case I would propose to add a logic to differentiate those two cases. `assertNoException` for the first one, `Environment#declineCheckpoint` or `StreamTask#declineCheckpoint` in the latter one. Or if it makes sense AND the task is left in a valid state, always `declineCheckpoint`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org