Simon-Shlomo Poil created FLINK-34696: -----------------------------------------
Summary: GSRecoverableWriterCommitter is generating excessive data blobs Key: FLINK-34696 URL: https://issues.apache.org/jira/browse/FLINK-34696 Project: Flink Issue Type: Bug Components: Connectors / FileSystem Reporter: Simon-Shlomo Poil In the "composeBlobs" method of org.apache.flink.fs.gs.writer.GSRecoverableWriterCommitter many small blobs are combined to generate a final single blob using the google storage compose method. This compose action is performed iteratively each time composing the resulting blob from the previous step with 31 new blobs until there are not remaining blobs. When the compose action is completed the temporary blobs are removed. This unfortunately leads to significant excessive use of data storage (which for google storage is a rather costly situation). *Simple example* We have 64 blobs each 1 GB; i.e. 64 GB 1st step: 32 blobs are composed into one blob; i.e. now 64 GB + 32 GB = 96 GB 2nd step: The 32 GB blob from previous step is composed with 31 blobs; now we have 64 GB + 32 GB + 63 GB = 159 GB 3rd step: The last remaining blob is composed with the blob from the previous step; now we have: 64 GB + 32 GB + 63 GB + 64 GB = 223 GB I.e. in order to combine 64 GB of data we had an overhead of 159 GB. *Why is this big issue?* With large amount of data the overhead becomes significant. With TiB of data we experienced peaks of PiB leading to unexpected high costs, and (maybe unrelated) frequent storage exceptions thrown by the Google Storage library. *Suggested solution:* When the blobs are composed together they should be deleted to not duplicate data. Maybe this has implications for recoverability? -- This message was sent by Atlassian Jira (v8.20.10#820010)