zentol opened a new pull request, #19968:
URL: https://github.com/apache/flink/pull/19968

   Based on #19957.
   
   The `StopWithSavepoint` state of the adaptive scheduler has to account for a 
number of failure scenarios.
   
   One of these failure cases is a task failure with the savepoint going 
through. This for example happens when an exception is thrown in 
`notifyCheckpointComplete()`. In FLINK-26923 the code was changed to not 
trigger a restart in this case, and instead hard-fail the job, on the grounds 
that continuing a job despite a savepoint being created could lead to duplicate 
data.
   This change was done incorrectly, because it assumed the savepoint future to 
always complete first. There is however no such guarantee that this is actually 
the case.
   Primarily because the checkpoint coordinator does not run in the main thread 
it can happen that `any` failure happening around the time the CC completes the 
savepoint can end up calling into `StopWithSavepoint#onFailure`.
   
   This PR remedies this by forcing the handling of task failures or terminal 
states to wait for the savepoint to complete.
   
   In addition the PR refactors the state to be a little bit messy. Both the 
general and savepoint failure code paths handling were setting variables that 
influenced each other in certain scenarios, making the code exceptionally 
difficult to reason about.
   This now only happens in one direction, with the savepoint failure handling 
potentially being skipped if another failure happened.
   
   Finally the PR adds a lot of documentation for what failure scenarios exist 
and why things are done the way they are.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to