[ https://issues.apache.org/jira/browse/FLINK-11499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17072435#comment-17072435 ]
Jingsong Lee commented on FLINK-11499: -------------------------------------- I'm trying to think about using parquet/orc bulk writer to support binary roll policies: * Writer should provide "flush" to flush complete batchs to output stream. * Writer should has ability to snapshot its pending uncompleted batch to checkpoint state. It is best to snapshot the writer uncompleted batch data structures. Actually we should let bulk writer to be a stateful operator. It seems to be a little difficult to do. We may need to do more deep works into parquet/orc. So I am +1 for [~pnowojski]'s opinion generally. Although it brings latency, slow recovery and double writes. From my side, the core problem is double writers, it will bring per bucket per file plaintext storage and IO cost. [~pnowojski] Have you consider using checkpoint stream? I think the checkpoint state backend is the closest storage for job. > Extend StreamingFileSink BulkFormats to support arbitrary roll policies > ----------------------------------------------------------------------- > > Key: FLINK-11499 > URL: https://issues.apache.org/jira/browse/FLINK-11499 > Project: Flink > Issue Type: Improvement > Components: Connectors / FileSystem > Reporter: Seth Wiesman > Priority: Major > Labels: usability > Fix For: 1.11.0 > > > Currently when using the StreamingFilleSink Bulk-encoding formats can only be > combined with the `OnCheckpointRollingPolicy`, which rolls the in-progress > part file on every checkpoint. > However, many bulk formats such as parquet are most efficient when written as > large files; this is not possible when frequent checkpointing is enabled. > Currently the only work-around is to have long checkpoint intervals which is > not ideal. > > The StreamingFileSink should be enhanced to support arbitrary roll policy's > so users may write large bulk files while retaining frequent checkpoints. -- This message was sent by Atlassian Jira (v8.3.4#803005)