rionmonster opened a new pull request, #27238: URL: https://github.com/apache/flink/pull/27238
## What is the purpose of the change This pull request addresses the issue detailed in [FLINK-38349](https://issues.apache.org/jira/browse/FLINK-38349) which describes scenarios in which certain edge-case values for `maxFanIn` and `channelIDs.size()` could result in floating-point errors that resulted in thrown `ArithmeticException` exceptions (specifically division by zero). This adds the appropriate safeguards to prevent these issues. The originating JIRA ticket provides detailed examples illustrating how the floating-point errors were reproduced. In summary, **the issue occurs whenever the channel count (`channelIDs.size()`) is greater than or equal to a power of `maxFanIn`:** ``` channelIDs.size() ≥ maxFanInⁿ ``` The overall change not only addresses the floating-point issues but also **improves scaling calculation performance by 4-5x** (see chart below). ## Brief change log - Updated the scaling logic that occurs within `SpillingThread.mergeChannelList()` by replacing the previous `Math.log()`-based calculations, which could result in floating point errors, in favor of an iterative, exponential search. - Added a private `computeMergeScaleForChannelSize()` function to encapsulate and document the updated scaling calculations. - Added a new `SpillingThreadTest` file with an associated, parameterized `testMergeChannelListDivideByZeroSafety` test that was used to reproduce the original issue and was later adjusted to confirm the fix. ## Verifying this change This change added tests and can be verified as follows: - Added new `SpillingThreadTest.testMergeChannelListDivideByZeroSafety` which originally reproduced the issue by creating a series of parameterized tests founds within the originating JIRA ticket. - Ensured all previous test suites within the related areas were passing as expected. In addition to these tests related to reproduction, I performed a series of manual tests to verify performance between the previous `Math.log()`-based approach (prone to floating-point error) and the newer iterative, exponential approach. Iterations | Old Approach (ns/call) | New Approach (ns/call) | Improvement -- | -- | -- | -- 1,000 | 33.07 | 17.68 | 1.87x 10,000 | 28.62 | 8.6 | 3.33x 100,000 | 8.94 | 4.42 | 2.02x 1,000,000 | 6.2 | 1.78 | 3.48x 100,000,000 | 5.66 | 1.28 | 4.42x You can [see the full gist here](https://gist.github.com/rionmonster/38d1e99ee0078ae1d2659986bbef8534) with all of the necessary code comparing the older and newer approaches (elected not to commit as it was purely for demonstration purposes). ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): **no** - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: **no** - The serializers: **no** - The runtime per-record code paths (performance sensitive): **yes / don't know** - Anything that affects deployment or recovery: **no** - The S3 file system connector: **no** ## Documentation - Does this pull request introduce a new feature? **no** - If yes, how is the feature documented? **not applicable** -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
