Simon-Shlomo Poil created FLINK-34696:
-----------------------------------------

             Summary: GSRecoverableWriterCommitter is generating excessive data 
blobs
                 Key: FLINK-34696
                 URL: https://issues.apache.org/jira/browse/FLINK-34696
             Project: Flink
          Issue Type: Bug
          Components: Connectors / FileSystem
            Reporter: Simon-Shlomo Poil


In the "composeBlobs" method of 
org.apache.flink.fs.gs.writer.GSRecoverableWriterCommitter
many small blobs are combined to generate a final single blob using the google 
storage compose method. This compose action is performed iteratively each time 
composing  the resulting blob from the previous step with 31 new blobs until 
there are not remaining blobs. When the compose action is completed the 
temporary blobs are removed.
 
This unfortunately leads to significant excessive use of data storage (which 
for google storage is a rather costly situation). 
 
*Simple example*
We have 64 blobs each 1 GB; i.e. 64 GB
1st step: 32 blobs are composed into one blob; i.e. now 64 GB + 32 GB = 96 GB
2nd step: The 32 GB blob from previous step is composed with 31 blobs; now we 
have 64 GB + 32 GB + 63 GB = 159 GB
3rd step: The last remaining blob is composed with the blob from the previous 
step; now we have: 64 GB + 32 GB + 63 GB + 64 GB = 223 GB
I.e. in order to combine 64 GB of data we had an overhead of 159 GB. 
 
*Why is this big issue?*
With large amount of data the overhead becomes significant. With TiB of data we 
experienced peaks of PiB leading to unexpected high costs, and (maybe 
unrelated) frequent storage exceptions thrown by the Google Storage library.
 
*Suggested solution:* 
When the blobs are composed together they should be deleted to not duplicate 
data.
Maybe this has implications for recoverability?
 
 
 
 
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to