Yes, the issue is recoverability. However, there is one thing you can do. Create a separate bucket in GCP for temporary files and initialize the filesystem configuration (via FileSystem.initialize) with *gs.writer.temporary.bucket.name <http://gs.writer.temporary.bucket.name>* set to the name of that bucket. This will cause the GSRecoverableWriter to write intermediate/temporary files to that bucket instead of the "real" bucket. Then, you can apply a TTL lifecycle policy <https://cloud.google.com/storage/docs/lifecycle>to the temporary bucket to have files be deleted after whatever TTL you want.
If you try to recover to a check/savepoint farther back in time than that TTL interval, the recovery will probably fail, but this will let you dial in whatever recoverability period you want, i.e. longer (at higher storage cost) or shorter (at lower storage cost). On Fri, Mar 15, 2024 at 10:04 AM Simon-Shlomo Poil (Jira) <j...@apache.org> wrote: > Simon-Shlomo Poil created FLINK-34696: > ----------------------------------------- > > Summary: GSRecoverableWriterCommitter is generating excessive > data blobs > Key: FLINK-34696 > URL: https://issues.apache.org/jira/browse/FLINK-34696 > Project: Flink > Issue Type: Bug > Components: Connectors / FileSystem > Reporter: Simon-Shlomo Poil > > > In the "composeBlobs" method of > org.apache.flink.fs.gs.writer.GSRecoverableWriterCommitter > many small blobs are combined to generate a final single blob using the > google storage compose method. This compose action is performed iteratively > each time composing the resulting blob from the previous step with 31 new > blobs until there are not remaining blobs. When the compose action is > completed the temporary blobs are removed. > > This unfortunately leads to significant excessive use of data storage > (which for google storage is a rather costly situation). > > *Simple example* > We have 64 blobs each 1 GB; i.e. 64 GB > 1st step: 32 blobs are composed into one blob; i.e. now 64 GB + 32 GB = 96 > GB > 2nd step: The 32 GB blob from previous step is composed with 31 blobs; now > we have 64 GB + 32 GB + 63 GB = 159 GB > 3rd step: The last remaining blob is composed with the blob from the > previous step; now we have: 64 GB + 32 GB + 63 GB + 64 GB = 223 GB > I.e. in order to combine 64 GB of data we had an overhead of 159 GB. > > *Why is this big issue?* > With large amount of data the overhead becomes significant. With TiB of > data we experienced peaks of PiB leading to unexpected high costs, and > (maybe unrelated) frequent storage exceptions thrown by the Google Storage > library. > > *Suggested solution:* > When the blobs are composed together they should be deleted to not > duplicate data. > Maybe this has implications for recoverability? > > > > > > > > > -- > This message was sent by Atlassian Jira > (v8.20.10#820010) >