zentol opened a new pull request, #19968: URL: https://github.com/apache/flink/pull/19968
Based on #19957. The `StopWithSavepoint` state of the adaptive scheduler has to account for a number of failure scenarios. One of these failure cases is a task failure with the savepoint going through. This for example happens when an exception is thrown in `notifyCheckpointComplete()`. In FLINK-26923 the code was changed to not trigger a restart in this case, and instead hard-fail the job, on the grounds that continuing a job despite a savepoint being created could lead to duplicate data. This change was done incorrectly, because it assumed the savepoint future to always complete first. There is however no such guarantee that this is actually the case. Primarily because the checkpoint coordinator does not run in the main thread it can happen that `any` failure happening around the time the CC completes the savepoint can end up calling into `StopWithSavepoint#onFailure`. This PR remedies this by forcing the handling of task failures or terminal states to wait for the savepoint to complete. In addition the PR refactors the state to be a little bit messy. Both the general and savepoint failure code paths handling were setting variables that influenced each other in certain scenarios, making the code exceptionally difficult to reason about. This now only happens in one direction, with the savepoint failure handling potentially being skipped if another failure happened. Finally the PR adds a lot of documentation for what failure scenarios exist and why things are done the way they are. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org