Rolling policy when using StreamingFileSink for bulk-encoded output

Ying Xu Mon, 24 Jun 2019 00:20:13 -0700

Dear Flink community:

We have a use case where StreamingFileSink
<https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/streamfile_sink.html>
is used for persisting bulk-encoded data to AWS s3. In our case, the data
sources consist of hybrid types of events, for which each type is uploaded
to an individual s3 prefix location. Because the event size is highly
skewed, the uploaded file size may differ dramatically.  In order to have a
better control over the uploaded file size, we would like to adopt a
rolling policy based on file sizes (e.g., roll the file every 100MB). Yet
it appears bulk-encoding StreamingFileSink only supports checkpoint-based
file rolling.


IMPORTANT: Bulk-encoding formats can only be combined with the
`OnCheckpointRollingPolicy`, which rolls the in-progress part file on every
checkpoint.

Checkpoint-based file rolling appears to have other side effects. For
instance, quite a lot of the heavy liftings (e.g file parts uploading) are
performed at the checkpointing time. As a result, checkpointing takes
longer duration when data volume is high.

Having a customized file rolling policy can be achieved by small
adjustments on the BulkFormatBuilder interface in StreamingFileSink. In the
case of using S3RecoverableWriter, file rolling triggers data uploading and
corresponding S3Committer is also constructed and stored. Hence on the
surface, adding a simple file-size based rolling policy would NOT
compromise the established exact-once guarantee.

Any advises on whether the above idea makes sense? Or perhaps there are
pitfalls that one might pay attention when introducing such rolling policy.
Thanks a lot!


-
Ying

Rolling policy when using StreamingFileSink for bulk-encoded output

Reply via email to