Re: [DISCUSS] Release testing procedures, Flink 1.3.2

Piotr Nowojski Wed, 26 Jul 2017 02:13:47 -0700

It took me longer then I expected but I was able reproduce data loss with older 
Flink versions while running fling in 3 nodes cluster. I have also validated 
that at-least-once semantic is fixed for Kafka 0.10 in Flink 1.3-SNAPSHOT.


Piotrek

> On Jul 20, 2017, at 4:52 PM, Stephan Ewen <se...@apache.org> wrote:
> 
> Thank you very much, for driving this!
> 
> On Thu, Jul 20, 2017 at 9:09 AM, Piotr Nowojski <pi...@data-artisans.com>
> wrote:
> 
>> Hi,
>> 
>> Regarding Kafka at-least-once bug. I could try to play with Flink 1.3.1 on
>> a real cluster to provoke this bug, by basically repeating
>> KafkaProducerTestBase#testOneToOneAtLeastOnce on a larger scale.
>> 
>> Piotrek
>> 
>>> On Jul 19, 2017, at 5:26 PM, Aljoscha Krettek <aljos...@apache.org>
>> wrote:
>>> 
>>> Hi,
>>> 
>>> Yes! In my opinion, the most critical issues are these:
>>> 
>>> - https://issues.apache.org/jira/browse/FLINK-6964: <
>> https://issues.apache.org/jira/browse/FLINK-6964:> Fix recovery for
>> incremental checkpoints in StandaloneCompletedCheckpointStore
>>> - https://issues.apache.org/jira/browse/FLINK-7041: <
>> https://issues.apache.org/jira/browse/FLINK-7041:> Deserialize
>> StateBackend from JobCheckpointingSettings with user classloader
>>> 
>>> The first one makes incremental checkpoints on RocksDB unusable with
>> externalised checkpoints. The latter means that you cannot have custom
>> configuration of the RocksDB backend.
>>> 
>>> - https://issues.apache.org/jira/browse/FLINK-7216: <
>> https://issues.apache.org/jira/browse/FLINK-7216:> ExecutionGraph can
>> perform concurrent global restarts to scheduling
>>> - https://issues.apache.org/jira/browse/FLINK-7153: <
>> https://issues.apache.org/jira/browse/FLINK-7153:> Eager Scheduling can't
>> allocate source for ExecutionGraph correctly
>>> 
>>> These are critical scheduler bugs, Stephan can probably say more about
>> them than I can.
>>> 
>>> - https://issues.apache.org/jira/browse/FLINK-7143: <
>> https://issues.apache.org/jira/browse/FLINK-7143:> Partition assignment
>> for Kafka consumer is not stable
>>> - https://issues.apache.org/jira/browse/FLINK-7195: <
>> https://issues.apache.org/jira/browse/FLINK-7195:> FlinkKafkaConsumer
>> should not respect fetched partitions to filter restored partition states
>>> - https://issues.apache.org/jira/browse/FLINK-6996: <
>> https://issues.apache.org/jira/browse/FLINK-6996:> FlinkKafkaProducer010
>> doesn't guarantee at-least-once semantic
>>> 
>>> The first one means that you can have duplicate data because several
>> consumers would be consuming from one partition, without noticing it. The
>> second one causes partitions to be dropped from state if a broker is
>> temporarily not reachable.
>>> 
>>> The first two issues would have been caught by my proposed testing
>> procedures, the third and fourth might be caught but are very tricky to
>> provoke. I’m currently experimenting with this testing procedure to Flink
>> 1.3.1 to see if I can provoke it.
>>> 
>>> The Kafka bugs are super hard to provoke because they only occur if
>> Kafka has some temporary problems or there are communication problems.
>>> 
>>> I forgot to mention that I have actually two goals with this: 1)
>> thoroughly test Flink and 2) build expertise in the community, i.e. we’re
>> forced to try cluster environments/distributions that we are not familiar
>> with and we actually deploy a full job and play around with it.
>>> 
>>> Best,
>>> Aljoscha
>>> 
>>> 
>>>> On 19. Jul 2017, at 15:49, Shaoxuan Wang <shaox...@apache.org> wrote:
>>>> 
>>>> Hi Aljoscha,
>>>> Glad to see that we have a more thorough testing procedure. Could you
>>>> please share us what (the critical issues you mentioned) have been
>> broken
>>>> in 1.3.0 & 1.3.1, and how the new proposed "functional testing section
>> and
>>>> a combination of systems/configurations" can cover this. This will help
>> us
>>>> to improve our production verification as well.
>>>> 
>>>> Regards,
>>>> Shaoxuan
>>>> 
>>>> 
>>>> On Wed, Jul 19, 2017 at 9:11 PM, Aljoscha Krettek <aljos...@apache.org>
>>>> wrote:
>>>> 
>>>>> Hi Everyone,
>>>>> 
>>>>> We are on the verge of starting the release process for Flink 1.3.2 and
>>>>> there have been some critical issues in both Flink 1.3.0 and 1.3.1. For
>>>>> Flink 1.3.2 I want to make very sure that we test as much as possible.
>> For
>>>>> this I’m proposing a slightly changed testing procedure [1]. This is
>>>>> similar to the testing document we used for previous releases but has
>> a new
>>>>> functional testing section that tries to outline a testing procedure
>> and a
>>>>> combination of systems/configurations that we have to test. Please
>> have a
>>>>> look and comment on whether you think this is sufficient (or a bit too
>>>>> much).
>>>>> 
>>>>> What do you think?
>>>>> 
>>>>> Best,
>>>>> Aljoscha
>>>>> 
>>>>> [1] https://docs.google.com/document/d/16fU1cpxoYf3o9cCDyakj7ZDnUoJTj
>>>>> 4_CEmMTpCkY81s/edit?usp=sharing
>>> 
>> 
>>

Re: [DISCUSS] Release testing procedures, Flink 1.3.2

Reply via email to