Thanks Gary! Sure, there are issues with updates in S3. You may want to look over EMRFS guarantees of the consistent view [1]. I'm not sure, is it possible in non-EMR AWS system or not.
I'm creating a JIRA issue regarding data loss possibility in S3. IMHO, Flink docs should mention about possible data loss in S3. [1] https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-consistent-view.html -- Thanks, Amit On Fri, May 18, 2018 at 2:48 AM, Gary Yao <g...@data-artisans.com> wrote: > Hi Amit, > > The BucketingSink doesn't have well defined semantics when used with S3. > Data > loss is possible but I am not sure whether it is the only problem. There are > plans to rewrite the BucketingSink in Flink 1.6 to enable eventually > consistent > file systems [1][2]. > > Best, > Gary > > > [1] > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/sink-with-BucketingSink-to-S3-files-override-td18433.html > [2] https://issues.apache.org/jira/browse/FLINK-6306 > > On Thu, May 17, 2018 at 11:57 AM, Amit Jain <aj201...@gmail.com> wrote: >> >> Hi, >> >> We are using Flink to process click stream data from Kafka and pushing >> the same in 128MB file in S3. >> >> What is the message processing guarantees with S3 sink? In my >> understanding, S3A client buffers the data on memory/disk. In failure >> scenario on particular node, TM would not trigger Writer#close hence >> buffered data can lose entirely assuming this buffer contains data of >> last successful checkpointing. >> >> -- >> Thanks, >> Amit > >