Sweet (maybe?)! How did you reproduce data-loss? Best, Aljoscha
> On 26. Jul 2017, at 11:13, Piotr Nowojski <pi...@data-artisans.com> wrote: > > It took me longer then I expected but I was able reproduce data loss with > older Flink versions while running fling in 3 nodes cluster. I have also > validated that at-least-once semantic is fixed for Kafka 0.10 in Flink > 1.3-SNAPSHOT. > > Piotrek > >> On Jul 20, 2017, at 4:52 PM, Stephan Ewen <se...@apache.org> wrote: >> >> Thank you very much, for driving this! >> >> On Thu, Jul 20, 2017 at 9:09 AM, Piotr Nowojski <pi...@data-artisans.com> >> wrote: >> >>> Hi, >>> >>> Regarding Kafka at-least-once bug. I could try to play with Flink 1.3.1 on >>> a real cluster to provoke this bug, by basically repeating >>> KafkaProducerTestBase#testOneToOneAtLeastOnce on a larger scale. >>> >>> Piotrek >>> >>>> On Jul 19, 2017, at 5:26 PM, Aljoscha Krettek <aljos...@apache.org> >>> wrote: >>>> >>>> Hi, >>>> >>>> Yes! In my opinion, the most critical issues are these: >>>> >>>> - https://issues.apache.org/jira/browse/FLINK-6964: < >>> https://issues.apache.org/jira/browse/FLINK-6964:> Fix recovery for >>> incremental checkpoints in StandaloneCompletedCheckpointStore >>>> - https://issues.apache.org/jira/browse/FLINK-7041: < >>> https://issues.apache.org/jira/browse/FLINK-7041:> Deserialize >>> StateBackend from JobCheckpointingSettings with user classloader >>>> >>>> The first one makes incremental checkpoints on RocksDB unusable with >>> externalised checkpoints. The latter means that you cannot have custom >>> configuration of the RocksDB backend. >>>> >>>> - https://issues.apache.org/jira/browse/FLINK-7216: < >>> https://issues.apache.org/jira/browse/FLINK-7216:> ExecutionGraph can >>> perform concurrent global restarts to scheduling >>>> - https://issues.apache.org/jira/browse/FLINK-7153: < >>> https://issues.apache.org/jira/browse/FLINK-7153:> Eager Scheduling can't >>> allocate source for ExecutionGraph correctly >>>> >>>> These are critical scheduler bugs, Stephan can probably say more about >>> them than I can. >>>> >>>> - https://issues.apache.org/jira/browse/FLINK-7143: < >>> https://issues.apache.org/jira/browse/FLINK-7143:> Partition assignment >>> for Kafka consumer is not stable >>>> - https://issues.apache.org/jira/browse/FLINK-7195: < >>> https://issues.apache.org/jira/browse/FLINK-7195:> FlinkKafkaConsumer >>> should not respect fetched partitions to filter restored partition states >>>> - https://issues.apache.org/jira/browse/FLINK-6996: < >>> https://issues.apache.org/jira/browse/FLINK-6996:> FlinkKafkaProducer010 >>> doesn't guarantee at-least-once semantic >>>> >>>> The first one means that you can have duplicate data because several >>> consumers would be consuming from one partition, without noticing it. The >>> second one causes partitions to be dropped from state if a broker is >>> temporarily not reachable. >>>> >>>> The first two issues would have been caught by my proposed testing >>> procedures, the third and fourth might be caught but are very tricky to >>> provoke. I’m currently experimenting with this testing procedure to Flink >>> 1.3.1 to see if I can provoke it. >>>> >>>> The Kafka bugs are super hard to provoke because they only occur if >>> Kafka has some temporary problems or there are communication problems. >>>> >>>> I forgot to mention that I have actually two goals with this: 1) >>> thoroughly test Flink and 2) build expertise in the community, i.e. we’re >>> forced to try cluster environments/distributions that we are not familiar >>> with and we actually deploy a full job and play around with it. >>>> >>>> Best, >>>> Aljoscha >>>> >>>> >>>>> On 19. Jul 2017, at 15:49, Shaoxuan Wang <shaox...@apache.org> wrote: >>>>> >>>>> Hi Aljoscha, >>>>> Glad to see that we have a more thorough testing procedure. Could you >>>>> please share us what (the critical issues you mentioned) have been >>> broken >>>>> in 1.3.0 & 1.3.1, and how the new proposed "functional testing section >>> and >>>>> a combination of systems/configurations" can cover this. This will help >>> us >>>>> to improve our production verification as well. >>>>> >>>>> Regards, >>>>> Shaoxuan >>>>> >>>>> >>>>> On Wed, Jul 19, 2017 at 9:11 PM, Aljoscha Krettek <aljos...@apache.org> >>>>> wrote: >>>>> >>>>>> Hi Everyone, >>>>>> >>>>>> We are on the verge of starting the release process for Flink 1.3.2 and >>>>>> there have been some critical issues in both Flink 1.3.0 and 1.3.1. For >>>>>> Flink 1.3.2 I want to make very sure that we test as much as possible. >>> For >>>>>> this I’m proposing a slightly changed testing procedure [1]. This is >>>>>> similar to the testing document we used for previous releases but has >>> a new >>>>>> functional testing section that tries to outline a testing procedure >>> and a >>>>>> combination of systems/configurations that we have to test. Please >>> have a >>>>>> look and comment on whether you think this is sufficient (or a bit too >>>>>> much). >>>>>> >>>>>> What do you think? >>>>>> >>>>>> Best, >>>>>> Aljoscha >>>>>> >>>>>> [1] https://docs.google.com/document/d/16fU1cpxoYf3o9cCDyakj7ZDnUoJTj >>>>>> 4_CEmMTpCkY81s/edit?usp=sharing >>>> >>> >>> >