[PR] [FLINK-38349][runtime] Address SpillingThread Zero-Division During Channel Merging Operations [flink]

via GitHub Thu, 13 Nov 2025 12:43:36 -0800


rionmonster opened a new pull request, #27238:
URL: https://github.com/apache/flink/pull/27238


   ## What is the purpose of the change
   
   This pull request addresses the issue detailed in 
[FLINK-38349](https://issues.apache.org/jira/browse/FLINK-38349) which 
describes scenarios in which certain edge-case values for `maxFanIn` and 
`channelIDs.size()` could result in floating-point errors that resulted in 
thrown `ArithmeticException` exceptions (specifically division by zero). This 
adds the appropriate safeguards to prevent these issues.
   
   The originating JIRA ticket provides detailed examples illustrating how the 
floating-point errors were reproduced. In summary, **the issue occurs whenever 
the channel count (`channelIDs.size()`) is greater than or equal to a power of 
`maxFanIn`:**
   ```
   channelIDs.size() ≥ maxFanInⁿ
   ```
   
   The overall change not only addresses the floating-point issues but also 
**improves scaling calculation performance by 4-5x** (see chart below).
   
   ## Brief change log
   - Updated the scaling logic that occurs within 
`SpillingThread.mergeChannelList()` by replacing the previous 
`Math.log()`-based calculations, which could result in floating point errors, 
in favor of an iterative, exponential search.
     - Added a private `computeMergeScaleForChannelSize()` function to 
encapsulate and document the updated scaling calculations.
   - Added a new `SpillingThreadTest` file with an associated, parameterized 
`testMergeChannelListDivideByZeroSafety` test that was used to reproduce the 
original issue and was later adjusted to confirm the fix.
   
   ## Verifying this change
   
   This change added tests and can be verified as follows:
   
     - Added new `SpillingThreadTest.testMergeChannelListDivideByZeroSafety` 
which originally reproduced the issue by creating a series of parameterized 
tests founds within the originating JIRA ticket.
     - Ensured all previous test suites within the related areas were passing 
as expected.
     
   In addition to these tests related to reproduction, I performed a series of 
manual tests to verify performance between the previous `Math.log()`-based 
approach (prone to floating-point error) and the newer iterative, exponential 
approach.
   
   Iterations | Old Approach (ns/call) | New Approach (ns/call) | Improvement
   -- | -- | -- | --
   1,000 | 33.07 | 17.68 | 1.87x
   10,000 | 28.62 | 8.6 | 3.33x
   100,000 | 8.94 | 4.42 | 2.02x
   1,000,000 | 6.2 | 1.78 | 3.48x
   100,000,000 | 5.66 | 1.28 | 4.42x
   
   You can [see the full gist 
here](https://gist.github.com/rionmonster/38d1e99ee0078ae1d2659986bbef8534) 
with all of the necessary code comparing the older and newer approaches 
(elected not to commit as it was purely for demonstration purposes).
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): **no**
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: **no**
     - The serializers: **no**
     - The runtime per-record code paths (performance sensitive): **yes / don't 
know**
     - Anything that affects deployment or recovery: **no**
     - The S3 file system connector: **no**
   
   ## Documentation
   
     - Does this pull request introduce a new feature? **no**
     - If yes, how is the feature documented? **not applicable**


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [FLINK-38349][runtime] Address SpillingThread Zero-Division During Channel Merging Operations [flink]

Reply via email to