Izeren opened a new pull request, #27740:
URL: https://github.com/apache/flink/pull/27740

   # Reopen of #27719
   
   ## What is the purpose of the change
   
   Fix the flaky test class `ExecutionGraphRestartTest`.
   
   **Root cause:** The original test used 
`ComponentMainThreadExecutorServiceAdapter.forMainThread()` which wraps a 
`DirectScheduledExecutorService` — its `execute(Runnable)` runs inline on the 
**calling** thread. When the `EXECUTOR_RESOURCE` thread ran deployment 
callbacks via `mainThreadExecutor.execute(callback)`, the callback executed on 
the `EXECUTOR_RESOURCE` thread while the test thread simultaneously mutated 
`ExecutionGraph` state, causing a race condition.
   
   ## Brief change log
   
   - Use a dedicated single-thread executor instead of `forMainThread()` to 
serialize all `ExecutionGraph` state mutations on one thread
   - Use `runInMainThread` helper with `.join()` for exception propagation
   - Add `offerSlotsFromMainThread` / `tryOfferSlotsFromMainThread` methods to 
`SlotPoolUtils` for callers already on the main thread (avoids self-deadlock 
from re-entrant `CompletableFuture.runAsync().join()`)
   - Create slot pool with correct `mainThreadExecutor` via 
`DeclarativeSlotPoolBridgeBuilder.setMainThreadExecutor()`
   - Move slot pool lifecycle to `@BeforeEach` / `@AfterEach`
   - Extract `createSchedulerBuilder` helper to reduce per-test boilerplate
   - Split `testFailingExecutionAfterRestart` into two phases to account for 
async restart callback queuing
   
   ## Verifying this change
   
   This change was verified by running `@RepeatedTest(300)` on all 7 test 
methods (2100 total executions) with 0 failures.
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): **no**
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: **no**
     - The serializers: **no**
     - The runtime per-record code paths (performance sensitive): **no**
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: **no**
     - The S3 file system connector: **no**
   
   ## Documentation
   
     - Does this pull request introduce a new feature? **no**
     - If yes, how is the feature documented? **not applicable**


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to