Hi Averell, You are right that for Bulk Formats like Parquet, we roll on every checkpoint.
This is currently a limitation that has to do with the fact that bulk formats gather and rely on metadata that they keep internally and which we cannot checkpoint in Flink,as they do not expose them. Setting the checkpoint interval affects how big your part files are going to be and, in some cases, how efficient your compression is going to be. In some cases, the more the data to be compressed, the better to compression ratio. Exposing the withBucketCheckInterval() you are right that it does not serve much for the moment. Cheers, Kostas > On Oct 5, 2018, at 1:54 AM, Averell <lvhu...@gmail.com> wrote: > > Hi Fabian, Kostas, > > From the description of this ticket > https://issues.apache.org/jira/browse/FLINK-9753, I understand that now my > output parquet file with StreamingFileSink will span multiple checkpoints. > However, when I tried (as in the here below code snippet) I still see that > one "part-X-X" file is created after each checkpoint. Is there any other > configuration that I'm missing? > > BTW, I have another question regarding > StreamingFileSink.BulkFormatBuilder.withBucketCheckInterval(). As per the > notes at the end of this page StreamingFileSink > <https://ci.apache.org/projects/flink/flink-docs-master/dev/connectors/streamfile_sink.html> > > , buck-enconding can only combined with OnCheckpointRollingPolicy, which > rolls on every checkpoint. So setting that CheckInterval makes no > difference. So why should we expose that withBucketCheckInterval method? > > Thanks and best regards, > Averell > > def buildSink[T <: MyBaseRecord](outputPath: String)(implicit ct: > ClassTag[T]): StreamingFileSink[T] = { > StreamingFileSink.forBulkFormat(new Path(outputPath), > ParquetAvroWriters.forReflectRecord(ct.runtimeClass)).asInstanceOf[StreamingFileSink.BulkFormatBuilder[T, > String]] > .withBucketCheckInterval(5L * 60L * 1000L) > .withBucketAssigner(new > DateTimeBucketAssigner[T]("yyyy-MM-dd--HH")) > .build() > } > > > > > -- > Sent from: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/