Hi, Regarding Kafka at-least-once bug. I could try to play with Flink 1.3.1 on a real cluster to provoke this bug, by basically repeating KafkaProducerTestBase#testOneToOneAtLeastOnce on a larger scale.
Piotrek > On Jul 19, 2017, at 5:26 PM, Aljoscha Krettek <aljos...@apache.org> wrote: > > Hi, > > Yes! In my opinion, the most critical issues are these: > > - https://issues.apache.org/jira/browse/FLINK-6964: > <https://issues.apache.org/jira/browse/FLINK-6964:> Fix recovery for > incremental checkpoints in StandaloneCompletedCheckpointStore > - https://issues.apache.org/jira/browse/FLINK-7041: > <https://issues.apache.org/jira/browse/FLINK-7041:> Deserialize StateBackend > from JobCheckpointingSettings with user classloader > > The first one makes incremental checkpoints on RocksDB unusable with > externalised checkpoints. The latter means that you cannot have custom > configuration of the RocksDB backend. > > - https://issues.apache.org/jira/browse/FLINK-7216: > <https://issues.apache.org/jira/browse/FLINK-7216:> ExecutionGraph can > perform concurrent global restarts to scheduling > - https://issues.apache.org/jira/browse/FLINK-7153: > <https://issues.apache.org/jira/browse/FLINK-7153:> Eager Scheduling can't > allocate source for ExecutionGraph correctly > > These are critical scheduler bugs, Stephan can probably say more about them > than I can. > > - https://issues.apache.org/jira/browse/FLINK-7143: > <https://issues.apache.org/jira/browse/FLINK-7143:> Partition assignment for > Kafka consumer is not stable > - https://issues.apache.org/jira/browse/FLINK-7195: > <https://issues.apache.org/jira/browse/FLINK-7195:> FlinkKafkaConsumer should > not respect fetched partitions to filter restored partition states > - https://issues.apache.org/jira/browse/FLINK-6996: > <https://issues.apache.org/jira/browse/FLINK-6996:> FlinkKafkaProducer010 > doesn't guarantee at-least-once semantic > > The first one means that you can have duplicate data because several > consumers would be consuming from one partition, without noticing it. The > second one causes partitions to be dropped from state if a broker is > temporarily not reachable. > > The first two issues would have been caught by my proposed testing > procedures, the third and fourth might be caught but are very tricky to > provoke. I’m currently experimenting with this testing procedure to Flink > 1.3.1 to see if I can provoke it. > > The Kafka bugs are super hard to provoke because they only occur if Kafka has > some temporary problems or there are communication problems. > > I forgot to mention that I have actually two goals with this: 1) thoroughly > test Flink and 2) build expertise in the community, i.e. we’re forced to try > cluster environments/distributions that we are not familiar with and we > actually deploy a full job and play around with it. > > Best, > Aljoscha > > >> On 19. Jul 2017, at 15:49, Shaoxuan Wang <shaox...@apache.org> wrote: >> >> Hi Aljoscha, >> Glad to see that we have a more thorough testing procedure. Could you >> please share us what (the critical issues you mentioned) have been broken >> in 1.3.0 & 1.3.1, and how the new proposed "functional testing section and >> a combination of systems/configurations" can cover this. This will help us >> to improve our production verification as well. >> >> Regards, >> Shaoxuan >> >> >> On Wed, Jul 19, 2017 at 9:11 PM, Aljoscha Krettek <aljos...@apache.org> >> wrote: >> >>> Hi Everyone, >>> >>> We are on the verge of starting the release process for Flink 1.3.2 and >>> there have been some critical issues in both Flink 1.3.0 and 1.3.1. For >>> Flink 1.3.2 I want to make very sure that we test as much as possible. For >>> this I’m proposing a slightly changed testing procedure [1]. This is >>> similar to the testing document we used for previous releases but has a new >>> functional testing section that tries to outline a testing procedure and a >>> combination of systems/configurations that we have to test. Please have a >>> look and comment on whether you think this is sufficient (or a bit too >>> much). >>> >>> What do you think? >>> >>> Best, >>> Aljoscha >>> >>> [1] https://docs.google.com/document/d/16fU1cpxoYf3o9cCDyakj7ZDnUoJTj >>> 4_CEmMTpCkY81s/edit?usp=sharing >