zentol commented on a change in pull request #14963: URL: https://github.com/apache/flink/pull/14963#discussion_r579108209
########## File path: flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/declarative/DeclarativeScheduler.java ########## @@ -907,20 +909,37 @@ public void runIfState(State expectedState, Runnable action, Duration delay) { // ---------------------------------------------------------------- + /** Note: Do not call this method from a State constructor. */ @VisibleForTesting - void transitionToState(State newState) { - if (state != newState) { - LOG.debug( - "Transition from state {} to {}.", - state.getClass().getSimpleName(), - newState.getClass().getSimpleName()); - - State oldState = state; - oldState.onLeave(newState.getClass()); - - state = newState; - newState.onEnter(); - } + <S extends State> void transitionToState(StateFactory<S> targetState) { + Preconditions.checkState( + state != null, "State transitions are now allowed while construcing a state."); + Preconditions.checkState( + state.getClass() != targetState.getStateClass(), + "Attempted to transition into the very state the scheduler is already in."); + + LOG.debug( + "Transition from state {} to {}.", + state.getClass().getSimpleName(), + targetState.getStateClass().getSimpleName()); + + State oldState = state; + oldState.onLeave(targetState.getStateClass()); + + // Guard against state transitions while constructing state objects. + // + // Consider the following scenario: + // Scheduler is in state Restarting, once the cancellation is complete, we enter the + // transitionToState(WaitingForResources) method. + // In the constructor of WaitingForResources, we call `notifyNewResourcesAvailable()`, which + // finds resources and enters transitionsToState(Executing). We are in state Executing. Then + // we return from the methods and go back in our call stack to the + // transitionToState(WaitingForResources) call, where we overwrite Executing with + // WaitingForResources. And there we have it, a deployed execution graph, and a scheduler + // that is in WaitingForResources. + state = null; Review comment: "The issues are fixed now; we don't need safeguards anymore" could just as well be used as an argument to keep the PR as is and even remove state transitions check. We fixed the one problematic case and could call it a day. Overall, my impression is that we should not allow immediate state transitions in the constructor, `onEnter`, or `onLeave`, in any case. Because all of these result in weird loops/interleaving of state transitions that can lead to subtle issues. IOW, `transitionToState` should be an atomic operation that fully completes before another transition can be triggered. Any attempt at triggering a state transition will fail hard. Hence, whether `onEnter` exists or not is actually not relevant in this consideration. While it was the sole case where this issue occurred, the underlying issues are unclear and unenforced contracts as to what a State is allowed to do in which methods. As you have shown in the PR it is pretty easy to safeguard against such occurrences; you'd just need to null the state before calling onLeave. Alternatively, this would also work, and would be a bit more consolidated: ``` oldState = state state = null; oldState.onLeave() newState = targetState.getState() newState.onEnter() Preconditions.checkState(state == null); // this will fail if any other state transition occurred in the mean time state = newState; ``` And as far as I'm concerned that's a pretty tiny cost compared to the risk of testing entirely theoretical scenarios or breaking the state machine. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org