MartijnVisser opened a new pull request, #28571:
URL: https://github.com/apache/flink/pull/28571

   ## What is the purpose of the change
   
   `RescaleTimelineITCase` is flaky on CI in two independent ways, fixed here 
as two commits:
   
   - **FLINK-40009** — `waitUntilConditionWithTimeout` never enforces its 
timeout when the test runs on a ForkJoinPool worker (the test can hang until 
the CI watchdog kills the fork; a thread dump in build 76242 showed it parked 
after 978s on a 20s budget).
   - **FLINK-40010** — 
`testRescaleTerminatedByNoResourcesOrNoParallelismsChange` can time out because 
the awaited terminal reason `NO_RESOURCES_OR_PARALLELISMS_CHANGE` is never 
recorded (builds 76242, 76350).
   
   ## Brief change log
   
   - **FLINK-40009**: `waitUntilConditionWithTimeout` wrapped an unbounded 
`CommonTestUtils#waitUntilCondition` in `CompletableFuture#runAsync` and 
bounded it with `get(timeout)`. On a ForkJoinPool worker, 
`CompletableFuture#timedGet` help-executes the async task inline, so the 
never-ending poll loop runs on the waiting thread and the timeout never fires. 
Replaced with a synchronous poll on the calling thread against a `Deadline`.
   - **FLINK-40010**: `NO_RESOURCES_OR_PARALLELISMS_CHANGE` is stamped only on 
the rescale tracked when the manager re-enters its Idling phase. With the short 
shared cooldown, the cooldown can elapse and Idling be reached before the 
requirements-update RPC is processed, so the `UPDATE_REQUIREMENT` rescale never 
receives the terminal reason. The fixture is rebuilt with a 10s cooldown (with 
the wait budget widened to 60s) so the update is processed in Cooldown and 
routed back through Idling. The shared cluster-rebuild logic is extracted into 
a helper reused by the sibling 
`testRescaleTerminatedByResourceRequirementsUpdated`.
   
   ## Verifying this change
   
   This change is already covered by existing tests in `RescaleTimelineITCase`. 
Verified locally: the failing method passes repeatedly; the full 
`RescaleTimelineITCase` is green (30 run, 8 skipped, 0 failures); a probe 
confirmed the rewritten wait helper throws `TimeoutException` when invoked on a 
ForkJoinPool worker (the old code hung there).
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
     - If yes, how is the feature documented? not applicable
   
   ---
   
   ##### Was generative AI tooling used to co-author this PR?
   
   - [X] Yes (Claude Opus 4.8 (1M context))
   
   Generated-by: Claude Opus 4.8 (1M context)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to