zentol commented on code in PR #19968: URL: https://github.com/apache/flink/pull/19968#discussion_r901688321
########## flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/StopWithSavepoint.java: ########## @@ -49,19 +49,48 @@ * of the operation) is made available via the "operationFuture" to the user. This operation is only * considered successfully if the "savepointFuture" completed successfully, and the job reached the * terminal state FINISHED. + * + * <p>This state has to cover several failure scenarios, depending on whether the savepoint + * succeeeds/fails and the job succeeds/fails/keeps running. + * + * <ul> + * <li>Savepoint succeeds, job succeeds - The happy path we like to see. + * <li>Savepoint fails, job fails - The generic failure case. Something happened during + * checkpointing on the TM side that also failed the task; fail the savepoint operation and + * restart the job. + * <li>Savepoint succeeds, job fails - Some issue occurred in notifyCheckpointComplete or during + * the job shutdown. Fail the savepoint operation and job, but inform the user about the + * created savepoint. + * <li>Savepoint fails, job keeps running - The savepoint failed due to an error on the JM side, + * before we ever triggered anything on the TM side. Fail the savepoint operation, but keep + * the job running. + * </ul> + * + * <p>This is further complicated by this information being transmitted via 2 separate RPCs from + * TM->JM, with the {@code savepointFuture} not being completed in the main thread, introducing + * ordering/lateness issues. Be careful to not liberally use {@link Context#runIfState(State, + * Runnable, Duration)} because it can result in a message being lost if multiple operations are + * queued and the first initiates a state transition. */ class StopWithSavepoint extends StateWithExecutionGraph { private final Context context; + /** + * The result future of this operation, containing the path to the savepoint. This is the future + * that other components (e.g., the REST API) wait for. + * + * <p>Must only be completed successfully if the savepoint was created and the job has FINISHED. + */ private final CompletableFuture<String> operationFuture; private final CheckpointScheduling checkpointScheduling; - private boolean hasFullyFinished = false; - - @Nullable private String savepoint = null; - @Nullable private Throwable operationFailureCause; + private boolean hasPendingStateTransition = false; + + // be careful when applying operations on this future that can trigger state transitions, + // as several other methods do the same and we mustn't trigger multiple transitions! + private final CompletableFuture<String> internalSavepointFuture = new CompletableFuture<>(); Review Comment: hmm if we'd move the expectFinished down a bit we'd have it covered... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org