[ https://issues.apache.org/jira/browse/FLINK-9609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
zhangminglei updated FLINK-9609: -------------------------------- Fix Version/s: 1.6.1 > Add bucket ready mechanism for BucketingSink when checkpoint complete > --------------------------------------------------------------------- > > Key: FLINK-9609 > URL: https://issues.apache.org/jira/browse/FLINK-9609 > Project: Flink > Issue Type: New Feature > Components: filesystem-connector, Streaming Connectors > Affects Versions: 1.5.0, 1.4.2 > Reporter: zhangminglei > Assignee: zhangminglei > Priority: Major > Labels: pull-request-available > Fix For: 1.6.1 > > > Currently, BucketingSink only support {{notifyCheckpointComplete}}. However, > users want to do some extra work when a bucket is ready. It would be nice if > we can support {{BucketReady}} mechanism for users or we can tell users when > a bucket is ready for use. For example, One bucket is created for every 5 > minutes, at the end of 5 minutes before creating the next bucket, the user > might need to do something as the previous bucket ready, like sending the > timestamp of the bucket ready time to a server or do some other stuff. > Here, Bucket ready means all the part files suffix name under a bucket > neither {{.pending}} nor {{.in-progress}}. Then we can think this bucket is > ready for user use. Like a watermark means no elements with a timestamp older > or equal to the watermark timestamp should arrive at the window. We can also > refer to the concept of watermark here, or we can call this *BucketWatermark* > if we could. > Recently, I found a user who wants this functionality which I would think. > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Let-BucketingSink-roll-file-on-each-checkpoint-td19034.html > Below is what he said: > My user case is we read data from message queue, write to HDFS, and our ETL > team will use the data in HDFS. *In the case, ETL need to know if all data is > ready to be read accurately*, so we use a counter to count how many data has > been wrote, if the counter is equal to the number we received, we think HDFS > file is ready. We send the counter message in a custom sink so ETL can know > how many data has been wrote, but if use current BucketingSink, even through > HDFS file is flushed, ETL may still cannot read the data. If we can close > file during checkpoint, then the result is accurately. And for the HDFS small > file problem, it can be controller by use bigger checkpoint interval. -- This message was sent by Atlassian JIRA (v7.6.3#76005)