[GitHub] [flink] pnowojski edited a comment on pull request #16905: [FLINK-23884][checkpoint] Fix the concurrent problem between triggering savepoint with drain and task finishing

GitBox Mon, 23 Aug 2021 02:58:53 -0700


pnowojski edited a comment on pull request #16905:
URL: https://github.com/apache/flink/pull/16905#issuecomment-903619167

Thanks @gaoyunhaii for taking care of this issue. At this moment I can not
do the full review, but on the high level I agree with your proposed solution.
Regarding:

> Do we need assert no exceptions ?
>
> I think perhaps we do not need to assert no exceptions here? If the task
is still running when triggering, the trigger would be expected to succeed, but
if the task is finished / canceled / failed, the trigger would be expected to
fail. Currently we already have logic to deal with the second case that the
checkpoint would be declined and in JM side a global failover would be
triggered to keep consistent. Thus it seems to me we could remove the assertion
? otherwise the trigger failure would cause the TM process to exit, which might
not be expected.

We need to handle those exceptions in some way. They can not be ignored. We
have learned a [hard way](https://issues.apache.org/jira/browse/FLINK-18137),
that at the very least, if no exception should happen in a chained futures,
such assertion should be added. Otherwise if there is a really unexpected
error, like NPE or even a JVM bug, system should know about them, instead of
for example deadlock. So one way or another, all exceptions should be handled.
For example:
- trigger task failover
- decline checkpoint
- log error/warning/debug message (that one would have to be carefully
thought through, [but it has been done in other
places](https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L604:L642))
- if none of the above makes sense, crashing JVM is the way to go.

> If the task is still running when triggering, the trigger would be
expected to succeed, but if the task is finished / canceled / failed, the
trigger would be expected to fail

In that case I would propose to add a logic to differentiate those two
cases. `assertNoException` for the first one, `Environment#declineCheckpoint`
or `StreamTask#declineCheckpoint` in the latter one. Or if it makes sense AND
the task is left in a valid state, always `declineCheckpoint`.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] pnowojski edited a comment on pull request #16905: [FLINK-23884][checkpoint] Fix the concurrent problem between triggering savepoint with drain and task finishing

Reply via email to