Re: [DISCUSS] Release testing procedures, Flink 1.3.2

Stephan Ewen Thu, 20 Jul 2017 07:53:15 -0700

Thank you very much, for driving this!

On Thu, Jul 20, 2017 at 9:09 AM, Piotr Nowojski <pi...@data-artisans.com>
wrote:


> Hi,
>
> Regarding Kafka at-least-once bug. I could try to play with Flink 1.3.1 on
> a real cluster to provoke this bug, by basically repeating
> KafkaProducerTestBase#testOneToOneAtLeastOnce on a larger scale.
>
> Piotrek
>
> > On Jul 19, 2017, at 5:26 PM, Aljoscha Krettek <aljos...@apache.org>
> wrote:
> >
> > Hi,
> >
> > Yes! In my opinion, the most critical issues are these:
> >
> > - https://issues.apache.org/jira/browse/FLINK-6964: <
> https://issues.apache.org/jira/browse/FLINK-6964:> Fix recovery for
> incremental checkpoints in StandaloneCompletedCheckpointStore
> > - https://issues.apache.org/jira/browse/FLINK-7041: <
> https://issues.apache.org/jira/browse/FLINK-7041:> Deserialize
> StateBackend from JobCheckpointingSettings with user classloader
> >
> > The first one makes incremental checkpoints on RocksDB unusable with
> externalised checkpoints. The latter means that you cannot have custom
> configuration of the RocksDB backend.
> >
> > - https://issues.apache.org/jira/browse/FLINK-7216: <
> https://issues.apache.org/jira/browse/FLINK-7216:> ExecutionGraph can
> perform concurrent global restarts to scheduling
> > - https://issues.apache.org/jira/browse/FLINK-7153: <
> https://issues.apache.org/jira/browse/FLINK-7153:> Eager Scheduling can't
> allocate source for ExecutionGraph correctly
> >
> > These are critical scheduler bugs, Stephan can probably say more about
> them than I can.
> >
> > - https://issues.apache.org/jira/browse/FLINK-7143: <
> https://issues.apache.org/jira/browse/FLINK-7143:> Partition assignment
> for Kafka consumer is not stable
> > - https://issues.apache.org/jira/browse/FLINK-7195: <
> https://issues.apache.org/jira/browse/FLINK-7195:> FlinkKafkaConsumer
> should not respect fetched partitions to filter restored partition states
> > - https://issues.apache.org/jira/browse/FLINK-6996: <
> https://issues.apache.org/jira/browse/FLINK-6996:> FlinkKafkaProducer010
> doesn't guarantee at-least-once semantic
> >
> > The first one means that you can have duplicate data because several
> consumers would be consuming from one partition, without noticing it. The
> second one causes partitions to be dropped from state if a broker is
> temporarily not reachable.
> >
> > The first two issues would have been caught by my proposed testing
> procedures, the third and fourth might be caught but are very tricky to
> provoke. I’m currently experimenting with this testing procedure to Flink
> 1.3.1 to see if I can provoke it.
> >
> > The Kafka bugs are super hard to provoke because they only occur if
> Kafka has some temporary problems or there are communication problems.
> >
> > I forgot to mention that I have actually two goals with this: 1)
> thoroughly test Flink and 2) build expertise in the community, i.e. we’re
> forced to try cluster environments/distributions that we are not familiar
> with and we actually deploy a full job and play around with it.
> >
> > Best,
> > Aljoscha
> >
> >
> >> On 19. Jul 2017, at 15:49, Shaoxuan Wang <shaox...@apache.org> wrote:
> >>
> >> Hi Aljoscha,
> >> Glad to see that we have a more thorough testing procedure. Could you
> >> please share us what (the critical issues you mentioned) have been
> broken
> >> in 1.3.0 & 1.3.1, and how the new proposed "functional testing section
> and
> >> a combination of systems/configurations" can cover this. This will help
> us
> >> to improve our production verification as well.
> >>
> >> Regards,
> >> Shaoxuan
> >>
> >>
> >> On Wed, Jul 19, 2017 at 9:11 PM, Aljoscha Krettek <aljos...@apache.org>
> >> wrote:
> >>
> >>> Hi Everyone,
> >>>
> >>> We are on the verge of starting the release process for Flink 1.3.2 and
> >>> there have been some critical issues in both Flink 1.3.0 and 1.3.1. For
> >>> Flink 1.3.2 I want to make very sure that we test as much as possible.
> For
> >>> this I’m proposing a slightly changed testing procedure [1]. This is
> >>> similar to the testing document we used for previous releases but has
> a new
> >>> functional testing section that tries to outline a testing procedure
> and a
> >>> combination of systems/configurations that we have to test. Please
> have a
> >>> look and comment on whether you think this is sufficient (or a bit too
> >>> much).
> >>>
> >>> What do you think?
> >>>
> >>> Best,
> >>> Aljoscha
> >>>
> >>> [1] https://docs.google.com/document/d/16fU1cpxoYf3o9cCDyakj7ZDnUoJTj
> >>> 4_CEmMTpCkY81s/edit?usp=sharing
> >
>
>

Re: [DISCUSS] Release testing procedures, Flink 1.3.2

Reply via email to