Weihua Hu created FLINK-32201:
---------------------------------

             Summary: Enable the distribution of shuffle descriptors via the 
blob server by connection number
                 Key: FLINK-32201
                 URL: https://issues.apache.org/jira/browse/FLINK-32201
             Project: Flink
          Issue Type: Sub-task
          Components: Runtime / Coordination
            Reporter: Weihua Hu



Flink support distributes shuffle descriptors via the blob server to reduce 
JobManager overhead. But the default threshold to enable it is 1MB, which never 
reaches. Users need to set a proper value for this, but it requires advanced 
knowledge before configuring it.

I would like to enable this feature by the number of connections of a group of 
shuffle descriptors. For examples, a simple streaming job with two operators, 
each with 10,000 parallelism and connected via all-to-all distribution. In this 
job, we only get one set of shuffle descriptors, and this group has 10000 * 
10000 connections. This means that JobManager needs to send this set of shuffle 
descriptors to 10000 tasks.

Since it is also difficult for users to configure, I would like to give it a 
default value. The serialized shuffle descriptors sizes for different 
parallelism are shown below.


|| Producer parallelism || serialized shuffle descriptor size || consumer 
parallelism || total data size that JM needs to send ||
| 5000 | 100KB | 5000 | 500MB |
| 10000 | 200KB | 10000 | 2GB |
| 20000 | 400Kb | 20000 | 8GB |

So, I would like to set the default value to 10,000 * 10,000. 

Any suggestions or concerns are appreciated.





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to