pnowojski commented on pull request #16905:
URL: https://github.com/apache/flink/pull/16905#issuecomment-903619167


   Thanks @gaoyunhaii  for taking care of this issue. At this moment I can not 
do the full review, but on the high level I agree with your proposed solution. 
Regarding:
   
   > Do we need assert no exceptions ?
   >
   > I think perhaps we do not need to assert no exceptions here? If the task 
is still running when triggering, the trigger would be expected to succeed, but 
if the task is finished / canceled / failed, the trigger would be expected to 
fail. Currently we already have logic to deal with the second case that the 
checkpoint would be declined and in JM side a global failover would be 
triggered to keep consistent. Thus it seems to me we could remove the assertion 
? otherwise the trigger failure would cause the TM process to exit, which might 
not be expected.
   
   We need to handle those exceptions in some way. They can not be ignored. We 
have learned a [hard way](https://issues.apache.org/jira/browse/FLINK-18137), 
that at the very least, if no exception should happen in a chained futures, 
such assertion should be added. Otherwise if there is a really unexpected 
error, like NPE or even a JVM bug, system should know about them, instead of 
for example deadlock. So one way or another, all exceptions should be handled. 
For example:
   - trigger task failover
   - decline checkpoint
   - log error/warning/debug message (that one would have to be carefully 
thought through, [but it has been done in other 
places](https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L604:L642))
   If none of the above makes sense, crashing JVM is the way to go.
   
   > If the task is still running when triggering, the trigger would be 
expected to succeed, but if the task is finished / canceled / failed, the 
trigger would be expected to fail
   
   In that case I would propose to add a logic to differentiate those two 
cases. `assertNoException` for the first one, `Environment#declineCheckpoint` 
or `StreamTask#declineCheckpoint` in the latter one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to