Re: How does BucketingSink generate a SUCCESS file when a directory is finished

Fabian Hueske Mon, 05 Feb 2018 02:35:13 -0800

In case of a failure, Flink rolls back the job to the last checkpoint and
reprocesses all data since that checkpoint.
Also the BucketingSink will truncate a file to the position of the last
checkpoint if the file system supports truncate. If not, it writes a file
with the valid length and starts a new file.


Therefore, all files that the BucketingSink finishes must be treated as
volatile until the next checkpoint is completed.
Only when a checkpoint is completed a finalized file may be read. The files
are renamed on checkpoint to signal that they are final and can be read.
This would also be the right time to generate a _SUCCESS file.
Have a look at the BucketingSink.notifyCheckpointComplete() method.

Best, Fabian




2018-02-05 6:43 GMT+01:00 xiaobin yan <yan.xiao.bin.m...@gmail.com>:

> Hi ,
>
> I have tested it. There are some small problems. When checkpoint is
> finished, the name of the file will change, and the success file will be
> written before checkpoint.
>
> Best,
> Ben
>
>
> On 1 Feb 2018, at 8:06 PM, Kien Truong <duckientru...@gmail.com> wrote:
>
> Hi,
>
> I did not actually test this, but I think with Flink 1.4 you can extend
> BucketingSink and overwrite the invoke method to access the watermark
>
> Pseudo code:
>
> invoke(IN value, SinkFunction.Context context) {
>
>    long currentWatermark = context.watermark()
>
>    long taskIndex = getRuntimeContext().getIndexOfThisSubtask()
>
>     if (taskIndex == 0 && currentWatermark - lastSuccessWatermark > 1 hour) {
>
>        Write _SUCCESS
>
>        lastSuccessWatermark = currentWatermark round down to 1 hour
>
>     }
>
>     invoke(value)
>
> }
>
>
> Regards,
> Kien
>
> On 1/31/2018 5:54 PM, xiaobin yan wrote:
>
> Hi:
>
> I think so too! But I have a question that when should I add this logic in 
> BucketingSink! And who does this logic, and ensures that the logic is 
> executed only once, not every parallel instance of the sink that executes 
> this logic!
>
> Best,
> Ben
>
>
> On 31 Jan 2018, at 5:58 PM, Hung <unicorn.bana...@gmail.com> 
> <unicorn.bana...@gmail.com> wrote:
>
> it depends on how you partition your file. in my case I write file per hour,
> so I'm sure that file is ready after that hour period, in processing time.
> Here, read to be ready means this file contains all the data in that hour
> period.
>
> If the downstream runs in a batch way, you may want to ensure the file is
> ready.
> In this case, ready to read can mean all the data before watermark as
> arrived.
> You could take the BucketingSink and implement this logic there, maybe wait
> until watermark
> reaches
>
> Best,
>
> Sendoh
>
>
>
> --
> Sent from: 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>
>
>

Re: How does BucketingSink generate a SUCCESS file when a directory is finished

Reply via email to