Re: How does BucketingSink generate a SUCCESS file when a directory is finished

xiaobin yan Mon, 05 Feb 2018 19:15:05 -0800

Hi ,

        You've got a point. I saw that method, but how can I make sure that all 
the subtasks checkpoint are finished, because I can only write _SUCCESS file at 
that time.


Best,
Ben

> On 5 Feb 2018, at 6:34 PM, Fabian Hueske <[email protected]> wrote:
> 
> In case of a failure, Flink rolls back the job to the last checkpoint and 
> reprocesses all data since that checkpoint. 
> Also the BucketingSink will truncate a file to the position of the last 
> checkpoint if the file system supports truncate. If not, it writes a file 
> with the valid length and starts a new file.
> 
> Therefore, all files that the BucketingSink finishes must be treated as 
> volatile until the next checkpoint is completed. 
> Only when a checkpoint is completed a finalized file may be read. The files 
> are renamed on checkpoint to signal that they are final and can be read. This 
> would also be the right time to generate a _SUCCESS file.
> Have a look at the BucketingSink.notifyCheckpointComplete() method.
> 
> Best, Fabian
> 
> 
> 
> 
> 2018-02-05 6:43 GMT+01:00 xiaobin yan <[email protected] 
> <mailto:[email protected]>>:
> Hi ,
> 
>       I have tested it. There are some small problems. When checkpoint is 
> finished, the name of the file will change, and the success file will be 
> written before checkpoint.
> 
> Best,
> Ben
> 
> 
>> On 1 Feb 2018, at 8:06 PM, Kien Truong <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Hi,
>> 
>> I did not actually test this, but I think with Flink 1.4 you can extend 
>> BucketingSink and overwrite the invoke method to access the watermark
>> Pseudo code:
>> invoke(IN value, SinkFunction.Context context) {
>>    long currentWatermark = context.watermark()
>>    long taskIndex = getRuntimeContext().getIndexOfThisSubtask()
>>     if (taskIndex == 0 && currentWatermark - lastSuccessWatermark > 1 hour) {
>>        Write _SUCCESS
>>        lastSuccessWatermark = currentWatermark round down to 1 hour
>>     }
>>     invoke(value)
>> }
>> 
>> Regards,
>> Kien
>> On 1/31/2018 5:54 PM, xiaobin yan wrote:
>>> Hi:
>>> 
>>> I think so too! But I have a question that when should I add this logic in 
>>> BucketingSink! And who does this logic, and ensures that the logic is 
>>> executed only once, not every parallel instance of the sink that executes 
>>> this logic!
>>> 
>>> Best,
>>> Ben
>>> 
>>>> On 31 Jan 2018, at 5:58 PM, Hung <[email protected]> 
>>>> <mailto:[email protected]> wrote:
>>>> 
>>>> it depends on how you partition your file. in my case I write file per 
>>>> hour,
>>>> so I'm sure that file is ready after that hour period, in processing time.
>>>> Here, read to be ready means this file contains all the data in that hour
>>>> period.
>>>> 
>>>> If the downstream runs in a batch way, you may want to ensure the file is
>>>> ready.
>>>> In this case, ready to read can mean all the data before watermark as
>>>> arrived.
>>>> You could take the BucketingSink and implement this logic there, maybe wait
>>>> until watermark
>>>> reaches
>>>> 
>>>> Best,
>>>> 
>>>> Sendoh
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Sent from: 
>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ 
>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
> 
>

Re: How does BucketingSink generate a SUCCESS file when a directory is finished

Reply via email to