Not an expert, but I would think this will not be trivial since the reason for using checkpointing to trigger is to guarantee exactly once semantics in the event of a failure which in turn is tightly integrated into the CP mechanism. The precursor the StreamingFileSink was BucketingFileSink which I believe did give some control over when to save, but it also suffered from duplicates in the file. I vaguely recall reading a Flink blogpost on this, but cant recall right now.
I sort of have the same desire, but I worked around it via periodically merging parquet files (doable as long as the schema is the same). This is out of process of course. Tim On Fri, May 31, 2019, 4:36 AM Ayush Verma <ayushver...@gmail.com> wrote: > Hello, > > I am using the StreamingFileSink BulkFormat in my Flink stream processing > job to output parquet files to S3. Now the > StreamingFileSink.BulkFormatBuilder > <https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/streaming/api/functions/sink/filesystem/StreamingFileSink.BulkFormatBuilder.html>, > does not have an option to set a custom rolling policy. It will roll the > files whenever the checkpoint triggers > <https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/streaming/api/functions/sink/filesystem/rollingpolicies/OnCheckpointRollingPolicy.html>. > It would be better to have a rolling policy based on both *size* and > *time*. One option is to write our own StreamingFileSink, which does > accept a custom rolling policy, but I suspect there might be some reason > for this behaviour. > I would like to get the opinion of Flink experts on this. And if there are > any potential workarounds to get the desired behaviour. > > Kind regards > Ayush Verma >