Re: StreamingFileSink doesn't close multipart uploads to s3?

Kostas Kloudas Mon, 09 Dec 2019 06:32:38 -0800

Hi Li,

This is the expected behavior. All the "exactly-once" sinks in Flink
require checkpointing to be enabled.
We will update the documentation to be clearer in the upcoming release.


Thanks a lot,
Kostas

On Sat, Dec 7, 2019 at 3:47 AM Li Peng <li.p...@doordash.com> wrote:
>
> Ok I seem to have solved the issue by enabling checkpointing. Based on the 
> docs (I'm using 1.9.0), it seemed like only StreamingFileSink.forBulkFormat() 
> should've required checkpointing, but based on this experience, 
> StreamingFileSink.forRowFormat() requires it too! Is this the intended 
> behavior? If so, the docs should probably be updated.
>
> Thanks,
> Li
>
> On Fri, Dec 6, 2019 at 2:01 PM Li Peng <li.p...@doordash.com> wrote:
>>
>> Hey folks, I'm trying to get StreamingFileSink to write to s3 every minute, 
>> with flink-s3-fs-hadoop, and based on the default rolling policy, which is 
>> configured to "roll" every 60 seconds, I thought that would be automatic (I 
>> interpreted rolling to mean actually close a multipart upload to s3).
>>
>> But I'm not actually seeing files written to s3 at all, instead I see a 
>> bunch of open multipart uploads when I check the AWS s3 console, for example:
>>
>>  "Uploads": [
>>         {
>>             "Initiated": "2019-12-06T20:57:47.000Z",
>>             "Key": "2019-12-06--20/part-0-0"
>>         },
>>         {
>>             "Initiated": "2019-12-06T20:57:47.000Z",
>>             "Key": "2019-12-06--20/part-1-0"
>>         },
>>         {
>>             "Initiated": "2019-12-06T21:03:12.000Z",
>>             "Key": "2019-12-06--21/part-0-1"
>>         },
>>         {
>>             "Initiated": "2019-12-06T21:04:15.000Z",
>>             "Key": "2019-12-06--21/part-0-2"
>>         },
>>         {
>>             "Initiated": "2019-12-06T21:22:23.000Z"
>>             "Key": "2019-12-06--21/part-0-3"
>>         }
>> ]
>>
>> And these uploads are being open for a long time. So far after an hour, none 
>> of the uploads have been closed. Is this the expected behavior? If I wanted 
>> to get these uploads to actually write to s3 quickly, do I need to configure 
>> the hadoop stuff to get that done, like setting a smaller buffer/partition 
>> size to force it to upload?
>>
>> Thanks,
>> Li

Re: StreamingFileSink doesn't close multipart uploads to s3?

Reply via email to