Dear Flink community: We have a use case where StreamingFileSink <https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/streamfile_sink.html> is used for persisting bulk-encoded data to AWS s3. In our case, the data sources consist of hybrid types of events, for which each type is uploaded to an individual s3 prefix location. Because the event size is highly skewed, the uploaded file size may differ dramatically. In order to have a better control over the uploaded file size, we would like to adopt a rolling policy based on file sizes (e.g., roll the file every 100MB). Yet it appears bulk-encoding StreamingFileSink only supports checkpoint-based file rolling.
IMPORTANT: Bulk-encoding formats can only be combined with the `OnCheckpointRollingPolicy`, which rolls the in-progress part file on every checkpoint. Checkpoint-based file rolling appears to have other side effects. For instance, quite a lot of the heavy liftings (e.g file parts uploading) are performed at the checkpointing time. As a result, checkpointing takes longer duration when data volume is high. Having a customized file rolling policy can be achieved by small adjustments on the BulkFormatBuilder interface in StreamingFileSink. In the case of using S3RecoverableWriter, file rolling triggers data uploading and corresponding S3Committer is also constructed and stored. Hence on the surface, adding a simple file-size based rolling policy would NOT compromise the established exact-once guarantee. Any advises on whether the above idea makes sense? Or perhaps there are pitfalls that one might pay attention when introducing such rolling policy. Thanks a lot! - Ying