[ 
https://issues.apache.org/jira/browse/FLINK-11499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17070963#comment-17070963
 ] 

Piotr Nowojski commented on FLINK-11499:
----------------------------------------

[~zenfenan] no, sorry for not being clear. By
> we need to have a point in the past were both of the streams align, so 
> probably if we want to roll main output stream at most once per hour, we 
> still need to roll it on checkpoint.

I meant that once we decide to roll the main output stream (after one hour for 
example), we still need to roll together with the WAL stream and both of those 
things will need to happen on checkpoint. In other words, we probably can not 
roll main stream in the middle of checkpoint. Or am I wrong?

Re the latency, slow recovery and double writes. I get those points. In the end 
I think I'm +1 for any more specialised implementations for some particular 
file formats, like Parquet and ORC that could avoid those problems. I think 
they wouldn't contradict the WAL approach. We might have WAL approach as a 
general solution for any bulk format, and we can also independently work on 
more specialised solutions.

Also if you think that we can easily provide specialised support for both 
Parquet and ORC, we could down prioritise WAL approach.

> Extend StreamingFileSink BulkFormats to support arbitrary roll policies
> -----------------------------------------------------------------------
>
>                 Key: FLINK-11499
>                 URL: https://issues.apache.org/jira/browse/FLINK-11499
>             Project: Flink
>          Issue Type: Improvement
>          Components: Connectors / FileSystem
>            Reporter: Seth Wiesman
>            Priority: Major
>              Labels: usability
>             Fix For: 1.11.0
>
>
> Currently when using the StreamingFilleSink Bulk-encoding formats can only be 
> combined with the `OnCheckpointRollingPolicy`, which rolls the in-progress 
> part file on every checkpoint.
> However, many bulk formats such as parquet are most efficient when written as 
> large files; this is not possible when frequent checkpointing is enabled. 
> Currently the only work-around is to have long checkpoint intervals which is 
> not ideal.
>  
> The StreamingFileSink should be enhanced to support arbitrary roll policy's 
> so users may write large bulk files while retaining frequent checkpoints.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to