[
https://issues.apache.org/jira/browse/FLINK-38536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18086727#comment-18086727
]
Martijn Visser commented on FLINK-38536:
----------------------------------------
>From my AI analyzer:
The failures are a test-harness data race, not a production bug.
FinalizeOnMasterTest (and its sibling ExecutionGraphFinishTest) wired the
scheduler with ComponentMainThreadExecutorServiceAdapter.forMainThread() as the
JobManager main-thread executor while passing a separate single-threaded I/O
executor. forMainThread() is backed by DirectScheduledExecutorService, whose
execute() runs the command inline on the calling thread rather than on one
dedicated main thread.
Execution#deploy() builds the TaskDeploymentDescriptor on the I/O executor via
CompletableFuture.supplyAsync(..., ioExecutor), then composes the continuation
back to the main thread with thenComposeAsync(..., mainThreadExecutor). Because
forMainThread() executes inline, those continuations — TDD creation, task
submission, and the deployment-completion handling that can call markFailed —
ran on the I/O thread, concurrently with the test thread still inside
startScheduling() (and, in ExecutionGraphFinishTest, later mutating state via
markFinished()). The unsynchronized concurrent mutation of the execution graph
produced both observed signatures:
- IllegalStateException: BUG: trying to schedule a region which is not in
CREATED state (region vertices mutated mid-scheduling), and
- expected: RUNNING but was: FAILING (a background deployment callback failed
an execution and triggered failover).
The async I/O-executor deploy path was introduced by FLINK-38114 (asynchronous
offloading of TaskRestore), which is why these tests started flaking now.
Production scheduling code is unchanged.
Fix
Run all execution-graph interactions on a dedicated single-threaded JobManager
main-thread executor (forSingleThreadExecutor over
TestingUtils.jmMainThreadExecutorExtension()), keeping the separate I/O
executor, so the deployment callbacks are serialized with the test logic
instead of racing on the I/O thread. No assertion or functional test logic
changed. The fix mirrors the pattern already used by ExecutionGraphSuspendTest
and SchedulerTestingUtils. The same remedy is applied preemptively to
ExecutionGraphFinishTest, which shares the wiring and code path.
> FinalizeOnMasterTest failed in test_cron_hadoop313 core
> -------------------------------------------------------
>
> Key: FLINK-38536
> URL: https://issues.apache.org/jira/browse/FLINK-38536
> Project: Flink
> Issue Type: Bug
> Components: Tests
> Affects Versions: 2.2.0
> Reporter: Ruan Hang
> Priority: Major
> Labels: pull-request-available
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=70334&view=logs&j=d89de3df-4600-5585-dadc-9bbc9a5e661c&t=d0a8f428-a6bd-5382-3317-923776675ef0
> {noformat}
> Oct 19 01:45:54 01:45:54.121 [ERROR] Tests run: 2, Failures: 1, Errors: 0,
> Skipped: 0, Time elapsed: 0.440 s <<< FAILURE! -- in
> org.apache.flink.runtime.executiongraph.FinalizeOnMasterTest
> Oct 19 01:45:54 01:45:54.121 [ERROR]
> org.apache.flink.runtime.executiongraph.FinalizeOnMasterTest.testFinalizeIsCalledUponSuccess
> -- Time elapsed: 0.038 s <<< FAILURE!
> Oct 19 01:45:54 org.opentest4j.AssertionFailedError:
> Oct 19 01:45:54
> Oct 19 01:45:54 expected: RUNNING
> Oct 19 01:45:54 but was: FAILING
> Oct 19 01:45:54 at
> org.apache.flink.runtime.executiongraph.FinalizeOnMasterTest.testFinalizeIsCalledUponSuccess(FinalizeOnMasterTest.java:72)
> Oct 19 01:45:54 at
> java.base/java.lang.reflect.Method.invoke(Method.java:568)
> Oct 19 01:45:54 at
> java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
> Oct 19 01:45:54 at
> java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
> Oct 19 01:45:54 at
> java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
> Oct 19 01:45:54 at
> java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)
> Oct 19 01:45:54 at
> java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)
> {noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)