By setting properties.setProperty("batch.size", "10240000"); properties.setProperty("linger.ms", "10000");
In the properties passed to FlinkKafkaProducer010 (to postpone automatic flushing) and killing (kill -9 PID) the YarnTaskManager process in the middle of executing a Flink job. Thus records that were not flushed automatically were lost (this at-least-once bug was about missing manual/explicit flush on a checkpoint). Piotrek > On Jul 26, 2017, at 1:19 PM, Aljoscha Krettek <aljos...@apache.org> wrote: > > Sweet (maybe?)! How did you reproduce data-loss? > > Best, > Aljoscha > >> On 26. Jul 2017, at 11:13, Piotr Nowojski <pi...@data-artisans.com> wrote: >> >> It took me longer then I expected but I was able reproduce data loss with >> older Flink versions while running fling in 3 nodes cluster. I have also >> validated that at-least-once semantic is fixed for Kafka 0.10 in Flink >> 1.3-SNAPSHOT. >> >> Piotrek >> >>> On Jul 20, 2017, at 4:52 PM, Stephan Ewen <se...@apache.org> wrote: >>> >>> Thank you very much, for driving this! >>> >>> On Thu, Jul 20, 2017 at 9:09 AM, Piotr Nowojski <pi...@data-artisans.com> >>> wrote: >>> >>>> Hi, >>>> >>>> Regarding Kafka at-least-once bug. I could try to play with Flink 1.3.1 on >>>> a real cluster to provoke this bug, by basically repeating >>>> KafkaProducerTestBase#testOneToOneAtLeastOnce on a larger scale. >>>> >>>> Piotrek >>>> >>>>> On Jul 19, 2017, at 5:26 PM, Aljoscha Krettek <aljos...@apache.org> >>>> wrote: >>>>> >>>>> Hi, >>>>> >>>>> Yes! In my opinion, the most critical issues are these: >>>>> >>>>> - https://issues.apache.org/jira/browse/FLINK-6964: < >>>> https://issues.apache.org/jira/browse/FLINK-6964:> Fix recovery for >>>> incremental checkpoints in StandaloneCompletedCheckpointStore >>>>> - https://issues.apache.org/jira/browse/FLINK-7041: < >>>> https://issues.apache.org/jira/browse/FLINK-7041:> Deserialize >>>> StateBackend from JobCheckpointingSettings with user classloader >>>>> >>>>> The first one makes incremental checkpoints on RocksDB unusable with >>>> externalised checkpoints. The latter means that you cannot have custom >>>> configuration of the RocksDB backend. >>>>> >>>>> - https://issues.apache.org/jira/browse/FLINK-7216: < >>>> https://issues.apache.org/jira/browse/FLINK-7216:> ExecutionGraph can >>>> perform concurrent global restarts to scheduling >>>>> - https://issues.apache.org/jira/browse/FLINK-7153: < >>>> https://issues.apache.org/jira/browse/FLINK-7153:> Eager Scheduling can't >>>> allocate source for ExecutionGraph correctly >>>>> >>>>> These are critical scheduler bugs, Stephan can probably say more about >>>> them than I can. >>>>> >>>>> - https://issues.apache.org/jira/browse/FLINK-7143: < >>>> https://issues.apache.org/jira/browse/FLINK-7143:> Partition assignment >>>> for Kafka consumer is not stable >>>>> - https://issues.apache.org/jira/browse/FLINK-7195: < >>>> https://issues.apache.org/jira/browse/FLINK-7195:> FlinkKafkaConsumer >>>> should not respect fetched partitions to filter restored partition states >>>>> - https://issues.apache.org/jira/browse/FLINK-6996: < >>>> https://issues.apache.org/jira/browse/FLINK-6996:> FlinkKafkaProducer010 >>>> doesn't guarantee at-least-once semantic >>>>> >>>>> The first one means that you can have duplicate data because several >>>> consumers would be consuming from one partition, without noticing it. The >>>> second one causes partitions to be dropped from state if a broker is >>>> temporarily not reachable. >>>>> >>>>> The first two issues would have been caught by my proposed testing >>>> procedures, the third and fourth might be caught but are very tricky to >>>> provoke. I’m currently experimenting with this testing procedure to Flink >>>> 1.3.1 to see if I can provoke it. >>>>> >>>>> The Kafka bugs are super hard to provoke because they only occur if >>>> Kafka has some temporary problems or there are communication problems. >>>>> >>>>> I forgot to mention that I have actually two goals with this: 1) >>>> thoroughly test Flink and 2) build expertise in the community, i.e. we’re >>>> forced to try cluster environments/distributions that we are not familiar >>>> with and we actually deploy a full job and play around with it. >>>>> >>>>> Best, >>>>> Aljoscha >>>>> >>>>> >>>>>> On 19. Jul 2017, at 15:49, Shaoxuan Wang <shaox...@apache.org> wrote: >>>>>> >>>>>> Hi Aljoscha, >>>>>> Glad to see that we have a more thorough testing procedure. Could you >>>>>> please share us what (the critical issues you mentioned) have been >>>> broken >>>>>> in 1.3.0 & 1.3.1, and how the new proposed "functional testing section >>>> and >>>>>> a combination of systems/configurations" can cover this. This will help >>>> us >>>>>> to improve our production verification as well. >>>>>> >>>>>> Regards, >>>>>> Shaoxuan >>>>>> >>>>>> >>>>>> On Wed, Jul 19, 2017 at 9:11 PM, Aljoscha Krettek <aljos...@apache.org> >>>>>> wrote: >>>>>> >>>>>>> Hi Everyone, >>>>>>> >>>>>>> We are on the verge of starting the release process for Flink 1.3.2 and >>>>>>> there have been some critical issues in both Flink 1.3.0 and 1.3.1. For >>>>>>> Flink 1.3.2 I want to make very sure that we test as much as possible. >>>> For >>>>>>> this I’m proposing a slightly changed testing procedure [1]. This is >>>>>>> similar to the testing document we used for previous releases but has >>>> a new >>>>>>> functional testing section that tries to outline a testing procedure >>>> and a >>>>>>> combination of systems/configurations that we have to test. Please >>>> have a >>>>>>> look and comment on whether you think this is sufficient (or a bit too >>>>>>> much). >>>>>>> >>>>>>> What do you think? >>>>>>> >>>>>>> Best, >>>>>>> Aljoscha >>>>>>> >>>>>>> [1] https://docs.google.com/document/d/16fU1cpxoYf3o9cCDyakj7ZDnUoJTj >>>>>>> 4_CEmMTpCkY81s/edit?usp=sharing >>>>> >>>> >>>> >> >