MartijnVisser opened a new pull request, #28571:
URL: https://github.com/apache/flink/pull/28571
## What is the purpose of the change
`RescaleTimelineITCase` is flaky on CI in two independent ways, fixed here
as two commits:
- **FLINK-40009** — `waitUntilConditionWithTimeout` never enforces its
timeout when the test runs on a ForkJoinPool worker (the test can hang until
the CI watchdog kills the fork; a thread dump in build 76242 showed it parked
after 978s on a 20s budget).
- **FLINK-40010** —
`testRescaleTerminatedByNoResourcesOrNoParallelismsChange` can time out because
the awaited terminal reason `NO_RESOURCES_OR_PARALLELISMS_CHANGE` is never
recorded (builds 76242, 76350).
## Brief change log
- **FLINK-40009**: `waitUntilConditionWithTimeout` wrapped an unbounded
`CommonTestUtils#waitUntilCondition` in `CompletableFuture#runAsync` and
bounded it with `get(timeout)`. On a ForkJoinPool worker,
`CompletableFuture#timedGet` help-executes the async task inline, so the
never-ending poll loop runs on the waiting thread and the timeout never fires.
Replaced with a synchronous poll on the calling thread against a `Deadline`.
- **FLINK-40010**: `NO_RESOURCES_OR_PARALLELISMS_CHANGE` is stamped only on
the rescale tracked when the manager re-enters its Idling phase. With the short
shared cooldown, the cooldown can elapse and Idling be reached before the
requirements-update RPC is processed, so the `UPDATE_REQUIREMENT` rescale never
receives the terminal reason. The fixture is rebuilt with a 10s cooldown (with
the wait budget widened to 60s) so the update is processed in Cooldown and
routed back through Idling. The shared cluster-rebuild logic is extracted into
a helper reused by the sibling
`testRescaleTerminatedByResourceRequirementsUpdated`.
## Verifying this change
This change is already covered by existing tests in `RescaleTimelineITCase`.
Verified locally: the failing method passes repeatedly; the full
`RescaleTimelineITCase` is green (30 run, 8 skipped, 0 failures); a probe
confirmed the rewritten wait helper throws `TimeoutException` when invoked on a
ForkJoinPool worker (the old code hung there).
## Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): no
- The public API, i.e., is any changed class annotated with
`@Public(Evolving)`: no
- The serializers: no
- The runtime per-record code paths (performance sensitive): no
- Anything that affects deployment or recovery: JobManager (and its
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
- The S3 file system connector: no
## Documentation
- Does this pull request introduce a new feature? no
- If yes, how is the feature documented? not applicable
---
##### Was generative AI tooling used to co-author this PR?
- [X] Yes (Claude Opus 4.8 (1M context))
Generated-by: Claude Opus 4.8 (1M context)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]