[jira] [Updated] (FLINK-9609) Add bucket ready mechanism for BucketingSink when checkpoint complete

zhangminglei (JIRA) Sat, 21 Jul 2018 19:49:23 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-9609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


zhangminglei updated FLINK-9609:
--------------------------------
    Fix Version/s: 1.6.1

> Add bucket ready mechanism for BucketingSink when checkpoint complete
> ---------------------------------------------------------------------
>
>                 Key: FLINK-9609
>                 URL: https://issues.apache.org/jira/browse/FLINK-9609
>             Project: Flink
>          Issue Type: New Feature
>          Components: filesystem-connector, Streaming Connectors
>    Affects Versions: 1.5.0, 1.4.2
>            Reporter: zhangminglei
>            Assignee: zhangminglei
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.6.1
>
>
> Currently, BucketingSink only support {{notifyCheckpointComplete}}. However, 
> users want to do some extra work when a bucket is ready. It would be nice if 
> we can support {{BucketReady}} mechanism for users or we can tell users when 
> a bucket is ready for use. For example, One bucket is created for every 5 
> minutes, at the end of 5 minutes before creating the next bucket, the user 
> might need to do something as the previous bucket ready, like sending the 
> timestamp of the bucket ready time to a server or do some other stuff.
> Here, Bucket ready means all the part files suffix name under a bucket 
> neither {{.pending}} nor {{.in-progress}}. Then we can think this bucket is 
> ready for user use. Like a watermark means no elements with a timestamp older 
> or equal to the watermark timestamp should arrive at the window. We can also 
> refer to the concept of watermark here, or we can call this *BucketWatermark* 
> if we could.
> Recently, I found a user who wants this functionality which I would think.
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Let-BucketingSink-roll-file-on-each-checkpoint-td19034.html
> Below is what he said:
> My user case is we read data from message queue, write to HDFS, and our ETL 
> team will use the data in HDFS. *In the case, ETL need to know if all data is 
> ready to be read accurately*, so we use a counter to count how many data has 
> been wrote, if the counter is equal to the number we received, we think HDFS 
> file is ready. We send the counter message in a custom sink so ETL can know 
> how many data has been wrote, but if use current BucketingSink, even through 
> HDFS file is flushed, ETL may still cannot read the data. If we can close 
> file during checkpoint, then the result is accurately. And for the HDFS small 
> file problem, it can be controller by use bigger checkpoint interval. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (FLINK-9609) Add bucket ready mechanism for BucketingSink when checkpoint complete

Reply via email to