Thank you very much, for driving this! On Thu, Jul 20, 2017 at 9:09 AM, Piotr Nowojski <pi...@data-artisans.com> wrote:
> Hi, > > Regarding Kafka at-least-once bug. I could try to play with Flink 1.3.1 on > a real cluster to provoke this bug, by basically repeating > KafkaProducerTestBase#testOneToOneAtLeastOnce on a larger scale. > > Piotrek > > > On Jul 19, 2017, at 5:26 PM, Aljoscha Krettek <aljos...@apache.org> > wrote: > > > > Hi, > > > > Yes! In my opinion, the most critical issues are these: > > > > - https://issues.apache.org/jira/browse/FLINK-6964: < > https://issues.apache.org/jira/browse/FLINK-6964:> Fix recovery for > incremental checkpoints in StandaloneCompletedCheckpointStore > > - https://issues.apache.org/jira/browse/FLINK-7041: < > https://issues.apache.org/jira/browse/FLINK-7041:> Deserialize > StateBackend from JobCheckpointingSettings with user classloader > > > > The first one makes incremental checkpoints on RocksDB unusable with > externalised checkpoints. The latter means that you cannot have custom > configuration of the RocksDB backend. > > > > - https://issues.apache.org/jira/browse/FLINK-7216: < > https://issues.apache.org/jira/browse/FLINK-7216:> ExecutionGraph can > perform concurrent global restarts to scheduling > > - https://issues.apache.org/jira/browse/FLINK-7153: < > https://issues.apache.org/jira/browse/FLINK-7153:> Eager Scheduling can't > allocate source for ExecutionGraph correctly > > > > These are critical scheduler bugs, Stephan can probably say more about > them than I can. > > > > - https://issues.apache.org/jira/browse/FLINK-7143: < > https://issues.apache.org/jira/browse/FLINK-7143:> Partition assignment > for Kafka consumer is not stable > > - https://issues.apache.org/jira/browse/FLINK-7195: < > https://issues.apache.org/jira/browse/FLINK-7195:> FlinkKafkaConsumer > should not respect fetched partitions to filter restored partition states > > - https://issues.apache.org/jira/browse/FLINK-6996: < > https://issues.apache.org/jira/browse/FLINK-6996:> FlinkKafkaProducer010 > doesn't guarantee at-least-once semantic > > > > The first one means that you can have duplicate data because several > consumers would be consuming from one partition, without noticing it. The > second one causes partitions to be dropped from state if a broker is > temporarily not reachable. > > > > The first two issues would have been caught by my proposed testing > procedures, the third and fourth might be caught but are very tricky to > provoke. I’m currently experimenting with this testing procedure to Flink > 1.3.1 to see if I can provoke it. > > > > The Kafka bugs are super hard to provoke because they only occur if > Kafka has some temporary problems or there are communication problems. > > > > I forgot to mention that I have actually two goals with this: 1) > thoroughly test Flink and 2) build expertise in the community, i.e. we’re > forced to try cluster environments/distributions that we are not familiar > with and we actually deploy a full job and play around with it. > > > > Best, > > Aljoscha > > > > > >> On 19. Jul 2017, at 15:49, Shaoxuan Wang <shaox...@apache.org> wrote: > >> > >> Hi Aljoscha, > >> Glad to see that we have a more thorough testing procedure. Could you > >> please share us what (the critical issues you mentioned) have been > broken > >> in 1.3.0 & 1.3.1, and how the new proposed "functional testing section > and > >> a combination of systems/configurations" can cover this. This will help > us > >> to improve our production verification as well. > >> > >> Regards, > >> Shaoxuan > >> > >> > >> On Wed, Jul 19, 2017 at 9:11 PM, Aljoscha Krettek <aljos...@apache.org> > >> wrote: > >> > >>> Hi Everyone, > >>> > >>> We are on the verge of starting the release process for Flink 1.3.2 and > >>> there have been some critical issues in both Flink 1.3.0 and 1.3.1. For > >>> Flink 1.3.2 I want to make very sure that we test as much as possible. > For > >>> this I’m proposing a slightly changed testing procedure [1]. This is > >>> similar to the testing document we used for previous releases but has > a new > >>> functional testing section that tries to outline a testing procedure > and a > >>> combination of systems/configurations that we have to test. Please > have a > >>> look and comment on whether you think this is sufficient (or a bit too > >>> much). > >>> > >>> What do you think? > >>> > >>> Best, > >>> Aljoscha > >>> > >>> [1] https://docs.google.com/document/d/16fU1cpxoYf3o9cCDyakj7ZDnUoJTj > >>> 4_CEmMTpCkY81s/edit?usp=sharing > > > >