Dennis-Mircea opened a new pull request, #1077:
URL: https://github.com/apache/flink-kubernetes-operator/pull/1077

   
   
   <!--
   *Thank you very much for contributing to the Apache Flink Kubernetes 
Operator - we are happy that you want to help us improve the project. To help 
the community review your contribution in the best possible way, please go 
through the checklist below, which will get the contribution into a shape in 
which it can be best reviewed.*
   
   ## Contribution Checklist
   
     - Make sure that the pull request corresponds to a [JIRA 
issue](https://issues.apache.org/jira/projects/FLINK/issues). Exceptions are 
made for typos in JavaDoc or documentation files, which need no JIRA issue.
     
     - Name the pull request in the form "[FLINK-XXXX] [component] Title of the 
pull request", where *FLINK-XXXX* should be replaced by the actual issue 
number. Skip *component* if you are unsure about which is the best component.
     Typo fixes that have no associated JIRA issue should be named following 
this pattern: `[hotfix][docs] Fix typo in event time introduction` or 
`[hotfix][javadocs] Expand JavaDoc for PuncuatedWatermarkGenerator`.
   
     - Fill out the template below to describe the changes contributed by the 
pull request. That will give reviewers the context they need to do the review.
     
     - Make sure that the change passes the automated tests, i.e., `mvn clean 
verify` passes. You can read more on how we use GitHub Actions for CI 
[here](https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/development/guide/#cicd).
   
     - Each pull request should address only one issue, not mix up code from 
multiple issues.
     
     - Each commit in the pull request has a meaningful commit message 
(including the JIRA id)
   
     - Once all items of the checklist are addressed, remove the above text and 
this checklist, leaving only the filled out template below.
   
   
   **(The sections below can be removed for hotfixes of typos)**
   -->
   
   ## What is the purpose of the change
   
   The parallelism alignment logic in `JobVertexScaler#scale` can **cancel**, 
**invert**, or **overshoot** scaling decisions for vertices that require 
key-group or source-partition alignment (sources or vertices with HASH upstream 
shuffle). The root cause is a single, non-directional code path that allows the 
alignment search to return values on the wrong side of `currentParallelism`.
   
   Additionally, the existing two modes (`EVENLY_SPREAD` and 
`MAXIMIZE_UTILISATION`) lack sufficient granularity, `EVENLY_SPREAD` silently 
runs an upward outside-range search *and* a downward relaxed fallback, making 
it impossible to configure a truly strict divisor-only mode. There is also no 
mechanism to compose modes (e.g., strict alignment first, then fall back to a 
relaxed strategy).
   
   JIRA: https://issues.apache.org/jira/browse/FLINK-39299
   
   
   ## Brief change log
   
   - **Direction-aware `scaleUp()` / `scaleDown()` entry points:** Split the 
single alignment path into two methods with structural direction guarantees. 
`scaleUp()` ensures results stay strictly above `currentParallelism`; 
`scaleDown()` ensures results stay strictly below.
   
   - **Three-phase alignment algorithm per mode:** Phase 1 (within-range 
divisor search), Phase 2 (upward outside-range search), Phase 3 (relaxed 
downward fallback). Each mode activates a subset of phases via 
`searchesWithinRange()` / `allowsOutsideRange()` predicates. Phase 2+3 are 
extracted into a shared `applyOutsideRangeSearch()` method to eliminate 
duplication between scale-up and scale-down.
   
   - **New `ADAPTIVE_UPWARD_SPREAD` mode (rename + new default):** Captures the 
old `EVENLY_SPREAD` behavior (combined divisor search + relaxed fallback + 
downward snap) under an accurate name. Now the default mode, preserving 
backward compatibility.
   
   - **Redefined `EVENLY_SPREAD`:** Now a truly strict within-range 
divisor-only mode. Blocks scaling if no exact divisor of N exists between 
`currentParallelism` and `newParallelism`.
   
   - **New `OPTIMIZE_RESOURCE_UTILIZATION` mode:** Minimal-strictness mode that 
accepts any `p ≤ N` (every subtask has work). Uses the autoscaler's computed 
target almost as-is.
   
   - **Composable fallback configuration (`mode.fallback`):** New config option 
`job.autoscaler.scaling.key-group.partitions.adjust.mode.fallback` with enum 
`KeyGroupOrPartitionsAdjustFallback` (`DEFAULT`, `NONE`, 
`ADAPTIVE_UPWARD_SPREAD`, `MAXIMIZE_UTILISATION`, 
`OPTIMIZE_RESOURCE_UTILIZATION`). Decouples preferred alignment from 
degradation strategy - any mode can be paired with any fallback. Sequential 
execution: primary mode's full algorithm runs first; only if it returns the 
sentinel does the fallback run as a second pass via `toMode()` mapping.
   
   - **Simplified `isAligned()`:** Removed dead `outsideRange` parameter and 
`newParallelism` argument since `isAligned()` is now only used for within-range 
(Phase 1) checks. Outside-range alignment is handled inline in 
`applyOutsideRangeSearch()`.
   
   - **Compliance-aware `WARN` logging:** When both mode and fallback fail, 
distinguishes whether `currentParallelism` is itself aligned (no better value 
in the direction) vs returned as a last resort (not aligned, purely preserving 
direction).
   
   - **Minor test/code fixes:** Replaced deprecated `conf.setInteger()` with 
`conf.set()` in tests; added generics to raw types in `JobAutoScalerImplTest`; 
updated one test expectation that was previously masking a direction-safety bug 
(scale-down from 22 with N=35 and lower limit 20 now correctly stays at 22 
instead of incorrectly dropping to 20).
   
   
   ## Verifying this change
   - All 22 existing `JobVertexScalerTest` tests pass, confirming backward 
compatibility of the `ADAPTIVE_UPWARD_SPREAD` default.
   - One test expectation updated: `cur=22, sf=1.1, N=35, lower=20` now expects 
22 (not 20), correctly reflecting that the alignment should not invert a 
scale-up into a scale-down below `currentParallelism`.
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changes to the `CustomResourceDescriptors`: 
no
     - Core observer or reconciler logic that is regularly executed: no
   
   ## Documentation
   
    - Does this pull request introduce a new feature? Yes, new 
`OPTIMIZE_RESOURCE_UTILIZATION` mode, redefined `EVENLY_SPREAD` semantics, 
`ADAPTIVE_UPWARD_SPREAD` default, and composable `mode.fallback` config option.
   - If yes, how is the feature documented? To be documented in a follow-up 
(configuration reference update for the two new config option values and the 
fallback config).
   
   Note: This PR focuses on the implementation of the direction-safe alignment 
architecture, new modes, and the composable fallback framework. Comprehensive 
unit tests covering the new modes, fallback combinations, and direction-safety 
edge cases will be added in a follow-up once there is agreement on the overall 
approach.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to