Re: Limitations in StreamingFileSink BulkFormat

Timothy Victor Fri, 31 May 2019 03:59:14 -0700

Not an expert, but I would think this will not be trivial since the reason
for using checkpointing to trigger is to guarantee exactly once semantics
in the event of a failure which in turn is tightly integrated into the CP
mechanism.  The precursor the StreamingFileSink was BucketingFileSink which
I believe did give some control over when to save, but it also suffered
from duplicates in the file.   I vaguely recall reading a Flink blogpost on
this, but cant recall right now.


I sort of have the same desire, but I worked around it via periodically
merging parquet files (doable as long as the schema is the same).  This is
out of process of course.

Tim

On Fri, May 31, 2019, 4:36 AM Ayush Verma <ayushver...@gmail.com> wrote:

> Hello,
>
> I am using the StreamingFileSink BulkFormat in my Flink stream processing
> job to output parquet files to S3. Now the
> StreamingFileSink.BulkFormatBuilder
> <https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/streaming/api/functions/sink/filesystem/StreamingFileSink.BulkFormatBuilder.html>,
> does not have an option to set a custom rolling policy. It will roll the
> files whenever the checkpoint triggers
> <https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/streaming/api/functions/sink/filesystem/rollingpolicies/OnCheckpointRollingPolicy.html>.
> It would be better to have a rolling policy based on both *size* and
> *time*. One option is to write our own StreamingFileSink, which does
> accept a custom rolling policy, but I suspect there might be some reason
> for this behaviour.
> I would like to get the opinion of Flink experts on this. And if there are
> any potential workarounds to get the desired behaviour.
>
> Kind regards
> Ayush Verma
>

Re: Limitations in StreamingFileSink BulkFormat

Reply via email to