zentol commented on code in PR #19968:
URL: https://github.com/apache/flink/pull/19968#discussion_r901688321


##########
flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/StopWithSavepoint.java:
##########
@@ -49,19 +49,48 @@
  * of the operation) is made available via the "operationFuture" to the user. 
This operation is only
  * considered successfully if the "savepointFuture" completed successfully, 
and the job reached the
  * terminal state FINISHED.
+ *
+ * <p>This state has to cover several failure scenarios, depending on whether 
the savepoint
+ * succeeeds/fails and the job succeeds/fails/keeps running.
+ *
+ * <ul>
+ *   <li>Savepoint succeeds, job succeeds - The happy path we like to see.
+ *   <li>Savepoint fails, job fails - The generic failure case. Something 
happened during
+ *       checkpointing on the TM side that also failed the task; fail the 
savepoint operation and
+ *       restart the job.
+ *   <li>Savepoint succeeds, job fails - Some issue occurred in 
notifyCheckpointComplete or during
+ *       the job shutdown. Fail the savepoint operation and job, but inform 
the user about the
+ *       created savepoint.
+ *   <li>Savepoint fails, job keeps running - The savepoint failed due to an 
error on the JM side,
+ *       before we ever triggered anything on the TM side. Fail the savepoint 
operation, but keep
+ *       the job running.
+ * </ul>
+ *
+ * <p>This is further complicated by this information being transmitted via 2 
separate RPCs from
+ * TM->JM, with the {@code savepointFuture} not being completed in the main 
thread, introducing
+ * ordering/lateness issues. Be careful to not liberally use {@link 
Context#runIfState(State,
+ * Runnable, Duration)} because it can result in a message being lost if 
multiple operations are
+ * queued and the first initiates a state transition.
  */
 class StopWithSavepoint extends StateWithExecutionGraph {
 
     private final Context context;
+    /**
+     * The result future of this operation, containing the path to the 
savepoint. This is the future
+     * that other components (e.g., the REST API) wait for.
+     *
+     * <p>Must only be completed successfully if the savepoint was created and 
the job has FINISHED.
+     */
     private final CompletableFuture<String> operationFuture;
 
     private final CheckpointScheduling checkpointScheduling;
 
-    private boolean hasFullyFinished = false;
-
-    @Nullable private String savepoint = null;
-
     @Nullable private Throwable operationFailureCause;
+    private boolean hasPendingStateTransition = false;
+
+    // be careful when applying operations on this future that can trigger 
state transitions,
+    // as several other methods do the same and we mustn't trigger multiple 
transitions!
+    private final CompletableFuture<String> internalSavepointFuture = new 
CompletableFuture<>();

Review Comment:
   hmm if we'd move the expectFinished down a bit we'd have it covered...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to