rkhachatryan commented on code in PR #21923: URL: https://github.com/apache/flink/pull/21923#discussion_r1124883714
########## flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/StreamTask.java: ########## @@ -733,8 +764,7 @@ void restoreInternal() throws Exception { // needed channelIOExecutor.shutdown(); - isRunning = true; - isRestoring = false; + taskState.status = TaskState.Status.RUNNING; Review Comment: > we don't need to keep that logic since if the job is canceled already we can easily ignore other exceptions which can happen since they can be actually because of cancelation Please note that Task cancellation is not the same as Job cancellation. For example, in case of some intermittent failure, the Job will be `RESTARTING`, but the tasks will be cancelled and then rescheduled. Ideally, we shouldn't be ignoring **any** exception during restart (just interruptions and cancellations). Otherwise, it might create a resource leak (job correctness won't be affected though). As for the `running` flag (which will report `true` on `master`; and `false` when CAS to `CANCELLED`); I briefly skimmed though its usages: - `Task.restoreAndInvoke` -> `StreamTask.invoke` -> `StreamTask.restoreInternal` - regression: it will allow 2nd `restoreInternal` - ongoing checkpoints will not be aborted by RPC - that's fine - local state will not be pruned by CHK abort RPC - regression (but **likely** will be pruned on the next checkpoint) - ... The above can probably be fixed (in this PR), but my point is that it's hard to check all the combinations and their usages. If that's true, I'd stick closer to the original behavior by having states **unless** we are sure that they are impossible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org