[ https://issues.apache.org/jira/browse/FLINK-36112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
liang yu updated FLINK-36112: ----------------------------- Description: {*}Description{*}: I am currently using Apache Flink to write files into Hadoop. The Flink application runs on a labeled YARN queue. During operation, it has been observed that the local disks on these labeled nodes get filled up quickly, and the network load is significantly high. This issue arises because Hadoop prioritizes writing files to the local node first, and the number of these labeled nodes is quite limited. {*}Problem{*}: The current behavior leads to inefficient disk space utilization and high network traffic on these few labeled nodes, which could potentially affect the performance and reliability of the application. As shown in the picture, the host I circled have a average net_bytes_sent speed 1.2GB/s while the others are just 50MB/s, this imbalance in network and disk space nearly destroyed the whole cluster. !image-2024-08-20-18-51-11-864.png! {*}Implementation{*}: The implementation would involve adding a method of FileSystem.class to support the {{CreateFlag.NO_LOCAL_WRITE}} when we try to create a new file through HadoopFileSystem.create() API. What's more, I modify the code of FileSink class so that we can choose to enable no_local_write or disable this feature. This will provide flexibility to Flink running in labeled Yarn queues to opt for non-local writes when necessary. was: {*}Description{*}: I am currently using Apache Flink to write files into Hadoop. The Flink application runs on a labeled YARN queue. During operation, it has been observed that the local disks on these labeled nodes get filled up quickly, and the network load is significantly high. This issue arises because Hadoop prioritizes writing files to the local node first, and the number of these labeled nodes is quite limited. {*}Problem{*}: The current behavior leads to inefficient disk space utilization and high network traffic on these few labeled nodes, which could potentially affect the performance and reliability of the application. As shown in the picture, the host I circled have a average net_bytes_sent speed 1.2GB/s while the others are just 50MB/s, this imbalance in network and disk space nearly destroyed the whole cluster. !6D939050-0BC4-4B17-A6A3-A1EBBD60338D.png! {*}Implementation{*}: The implementation would involve adding a method of FileSystem.class to support the {{CreateFlag.NO_LOCAL_WRITE}} when we try to create a new file through HadoopFileSystem.create() API. What's more, I modify the code of FileSink class so that we can choose to enable no_local_write or disable this feature. This will provide flexibility to Flink running in labeled Yarn queues to opt for non-local writes when necessary. > Add Support for CreateFlag.NO_LOCAL_WRITE in FLINK on YARN's File Creation to > Manage Disk Space and Network Load in Labeled YARN Nodes > -------------------------------------------------------------------------------------------------------------------------------------- > > Key: FLINK-36112 > URL: https://issues.apache.org/jira/browse/FLINK-36112 > Project: Flink > Issue Type: Improvement > Reporter: liang yu > Priority: Major > Attachments: image-2024-08-20-18-51-11-864.png > > > {*}Description{*}: I am currently using Apache Flink to write files into > Hadoop. The Flink application runs on a labeled YARN queue. During operation, > it has been observed that the local disks on these labeled nodes get filled > up quickly, and the network load is significantly high. This issue arises > because Hadoop prioritizes writing files to the local node first, and the > number of these labeled nodes is quite limited. > > {*}Problem{*}: The current behavior leads to inefficient disk space > utilization and high network traffic on these few labeled nodes, which could > potentially affect the performance and reliability of the application. As > shown in the picture, the host I circled have a average net_bytes_sent speed > 1.2GB/s while the others are just 50MB/s, this imbalance in network and disk > space nearly destroyed the whole cluster. > > !image-2024-08-20-18-51-11-864.png! > > {*}Implementation{*}: The implementation would involve adding a method of > FileSystem.class to support the {{CreateFlag.NO_LOCAL_WRITE}} when we try to > create a new file through HadoopFileSystem.create() API. What's more, I > modify the code of FileSink class so that we can choose to enable > no_local_write or disable this feature. This will provide flexibility to > Flink running in labeled Yarn queues to opt for non-local writes when > necessary. > -- This message was sent by Atlassian Jira (v8.20.10#820010)