It took me longer then I expected but I was able reproduce data loss with older Flink versions while running fling in 3 nodes cluster. I have also validated that at-least-once semantic is fixed for Kafka 0.10 in Flink 1.3-SNAPSHOT.
Piotrek > On Jul 20, 2017, at 4:52 PM, Stephan Ewen <se...@apache.org> wrote: > > Thank you very much, for driving this! > > On Thu, Jul 20, 2017 at 9:09 AM, Piotr Nowojski <pi...@data-artisans.com> > wrote: > >> Hi, >> >> Regarding Kafka at-least-once bug. I could try to play with Flink 1.3.1 on >> a real cluster to provoke this bug, by basically repeating >> KafkaProducerTestBase#testOneToOneAtLeastOnce on a larger scale. >> >> Piotrek >> >>> On Jul 19, 2017, at 5:26 PM, Aljoscha Krettek <aljos...@apache.org> >> wrote: >>> >>> Hi, >>> >>> Yes! In my opinion, the most critical issues are these: >>> >>> - https://issues.apache.org/jira/browse/FLINK-6964: < >> https://issues.apache.org/jira/browse/FLINK-6964:> Fix recovery for >> incremental checkpoints in StandaloneCompletedCheckpointStore >>> - https://issues.apache.org/jira/browse/FLINK-7041: < >> https://issues.apache.org/jira/browse/FLINK-7041:> Deserialize >> StateBackend from JobCheckpointingSettings with user classloader >>> >>> The first one makes incremental checkpoints on RocksDB unusable with >> externalised checkpoints. The latter means that you cannot have custom >> configuration of the RocksDB backend. >>> >>> - https://issues.apache.org/jira/browse/FLINK-7216: < >> https://issues.apache.org/jira/browse/FLINK-7216:> ExecutionGraph can >> perform concurrent global restarts to scheduling >>> - https://issues.apache.org/jira/browse/FLINK-7153: < >> https://issues.apache.org/jira/browse/FLINK-7153:> Eager Scheduling can't >> allocate source for ExecutionGraph correctly >>> >>> These are critical scheduler bugs, Stephan can probably say more about >> them than I can. >>> >>> - https://issues.apache.org/jira/browse/FLINK-7143: < >> https://issues.apache.org/jira/browse/FLINK-7143:> Partition assignment >> for Kafka consumer is not stable >>> - https://issues.apache.org/jira/browse/FLINK-7195: < >> https://issues.apache.org/jira/browse/FLINK-7195:> FlinkKafkaConsumer >> should not respect fetched partitions to filter restored partition states >>> - https://issues.apache.org/jira/browse/FLINK-6996: < >> https://issues.apache.org/jira/browse/FLINK-6996:> FlinkKafkaProducer010 >> doesn't guarantee at-least-once semantic >>> >>> The first one means that you can have duplicate data because several >> consumers would be consuming from one partition, without noticing it. The >> second one causes partitions to be dropped from state if a broker is >> temporarily not reachable. >>> >>> The first two issues would have been caught by my proposed testing >> procedures, the third and fourth might be caught but are very tricky to >> provoke. I’m currently experimenting with this testing procedure to Flink >> 1.3.1 to see if I can provoke it. >>> >>> The Kafka bugs are super hard to provoke because they only occur if >> Kafka has some temporary problems or there are communication problems. >>> >>> I forgot to mention that I have actually two goals with this: 1) >> thoroughly test Flink and 2) build expertise in the community, i.e. we’re >> forced to try cluster environments/distributions that we are not familiar >> with and we actually deploy a full job and play around with it. >>> >>> Best, >>> Aljoscha >>> >>> >>>> On 19. Jul 2017, at 15:49, Shaoxuan Wang <shaox...@apache.org> wrote: >>>> >>>> Hi Aljoscha, >>>> Glad to see that we have a more thorough testing procedure. Could you >>>> please share us what (the critical issues you mentioned) have been >> broken >>>> in 1.3.0 & 1.3.1, and how the new proposed "functional testing section >> and >>>> a combination of systems/configurations" can cover this. This will help >> us >>>> to improve our production verification as well. >>>> >>>> Regards, >>>> Shaoxuan >>>> >>>> >>>> On Wed, Jul 19, 2017 at 9:11 PM, Aljoscha Krettek <aljos...@apache.org> >>>> wrote: >>>> >>>>> Hi Everyone, >>>>> >>>>> We are on the verge of starting the release process for Flink 1.3.2 and >>>>> there have been some critical issues in both Flink 1.3.0 and 1.3.1. For >>>>> Flink 1.3.2 I want to make very sure that we test as much as possible. >> For >>>>> this I’m proposing a slightly changed testing procedure [1]. This is >>>>> similar to the testing document we used for previous releases but has >> a new >>>>> functional testing section that tries to outline a testing procedure >> and a >>>>> combination of systems/configurations that we have to test. Please >> have a >>>>> look and comment on whether you think this is sufficient (or a bit too >>>>> much). >>>>> >>>>> What do you think? >>>>> >>>>> Best, >>>>> Aljoscha >>>>> >>>>> [1] https://docs.google.com/document/d/16fU1cpxoYf3o9cCDyakj7ZDnUoJTj >>>>> 4_CEmMTpCkY81s/edit?usp=sharing >>> >> >>