Weihua Hu created FLINK-32201: --------------------------------- Summary: Enable the distribution of shuffle descriptors via the blob server by connection number Key: FLINK-32201 URL: https://issues.apache.org/jira/browse/FLINK-32201 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Reporter: Weihua Hu
Flink support distributes shuffle descriptors via the blob server to reduce JobManager overhead. But the default threshold to enable it is 1MB, which never reaches. Users need to set a proper value for this, but it requires advanced knowledge before configuring it. I would like to enable this feature by the number of connections of a group of shuffle descriptors. For examples, a simple streaming job with two operators, each with 10,000 parallelism and connected via all-to-all distribution. In this job, we only get one set of shuffle descriptors, and this group has 10000 * 10000 connections. This means that JobManager needs to send this set of shuffle descriptors to 10000 tasks. Since it is also difficult for users to configure, I would like to give it a default value. The serialized shuffle descriptors sizes for different parallelism are shown below. || Producer parallelism || serialized shuffle descriptor size || consumer parallelism || total data size that JM needs to send || | 5000 | 100KB | 5000 | 500MB | | 10000 | 200KB | 10000 | 2GB | | 20000 | 400Kb | 20000 | 8GB | So, I would like to set the default value to 10,000 * 10,000. Any suggestions or concerns are appreciated. -- This message was sent by Atlassian Jira (v8.20.10#820010)