Re: [DISCUSS] Release testing procedures, Flink 1.3.2

Piotr Nowojski Thu, 20 Jul 2017 00:10:18 -0700

Hi,

Regarding Kafka at-least-once bug. I could try to play with Flink 1.3.1 on a 
real cluster to provoke this bug, by basically repeating 
KafkaProducerTestBase#testOneToOneAtLeastOnce on a larger scale.


Piotrek

> On Jul 19, 2017, at 5:26 PM, Aljoscha Krettek <aljos...@apache.org> wrote:
> 
> Hi,
> 
> Yes! In my opinion, the most critical issues are these:
> 
> - https://issues.apache.org/jira/browse/FLINK-6964: 
> <https://issues.apache.org/jira/browse/FLINK-6964:> Fix recovery for 
> incremental checkpoints in StandaloneCompletedCheckpointStore
> - https://issues.apache.org/jira/browse/FLINK-7041: 
> <https://issues.apache.org/jira/browse/FLINK-7041:> Deserialize StateBackend 
> from JobCheckpointingSettings with user classloader
> 
> The first one makes incremental checkpoints on RocksDB unusable with 
> externalised checkpoints. The latter means that you cannot have custom 
> configuration of the RocksDB backend.
> 
> - https://issues.apache.org/jira/browse/FLINK-7216: 
> <https://issues.apache.org/jira/browse/FLINK-7216:> ExecutionGraph can 
> perform concurrent global restarts to scheduling
> - https://issues.apache.org/jira/browse/FLINK-7153: 
> <https://issues.apache.org/jira/browse/FLINK-7153:> Eager Scheduling can't 
> allocate source for ExecutionGraph correctly
> 
> These are critical scheduler bugs, Stephan can probably say more about them 
> than I can.
> 
> - https://issues.apache.org/jira/browse/FLINK-7143: 
> <https://issues.apache.org/jira/browse/FLINK-7143:> Partition assignment for 
> Kafka consumer is not stable
> - https://issues.apache.org/jira/browse/FLINK-7195: 
> <https://issues.apache.org/jira/browse/FLINK-7195:> FlinkKafkaConsumer should 
> not respect fetched partitions to filter restored partition states
> - https://issues.apache.org/jira/browse/FLINK-6996: 
> <https://issues.apache.org/jira/browse/FLINK-6996:> FlinkKafkaProducer010 
> doesn't guarantee at-least-once semantic
> 
> The first one means that you can have duplicate data because several 
> consumers would be consuming from one partition, without noticing it. The 
> second one causes partitions to be dropped from state if a broker is 
> temporarily not reachable.
> 
> The first two issues would have been caught by my proposed testing 
> procedures, the third and fourth might be caught but are very tricky to 
> provoke. I’m currently experimenting with this testing procedure to Flink 
> 1.3.1 to see if I can provoke it.
> 
> The Kafka bugs are super hard to provoke because they only occur if Kafka has 
> some temporary problems or there are communication problems.
> 
> I forgot to mention that I have actually two goals with this: 1) thoroughly 
> test Flink and 2) build expertise in the community, i.e. we’re forced to try 
> cluster environments/distributions that we are not familiar with and we 
> actually deploy a full job and play around with it.
> 
> Best,
> Aljoscha
> 
> 
>> On 19. Jul 2017, at 15:49, Shaoxuan Wang <shaox...@apache.org> wrote:
>> 
>> Hi Aljoscha,
>> Glad to see that we have a more thorough testing procedure. Could you
>> please share us what (the critical issues you mentioned) have been broken
>> in 1.3.0 & 1.3.1, and how the new proposed "functional testing section and
>> a combination of systems/configurations" can cover this. This will help us
>> to improve our production verification as well.
>> 
>> Regards,
>> Shaoxuan
>> 
>> 
>> On Wed, Jul 19, 2017 at 9:11 PM, Aljoscha Krettek <aljos...@apache.org>
>> wrote:
>> 
>>> Hi Everyone,
>>> 
>>> We are on the verge of starting the release process for Flink 1.3.2 and
>>> there have been some critical issues in both Flink 1.3.0 and 1.3.1. For
>>> Flink 1.3.2 I want to make very sure that we test as much as possible. For
>>> this I’m proposing a slightly changed testing procedure [1]. This is
>>> similar to the testing document we used for previous releases but has a new
>>> functional testing section that tries to outline a testing procedure and a
>>> combination of systems/configurations that we have to test. Please have a
>>> look and comment on whether you think this is sufficient (or a bit too
>>> much).
>>> 
>>> What do you think?
>>> 
>>> Best,
>>> Aljoscha
>>> 
>>> [1] https://docs.google.com/document/d/16fU1cpxoYf3o9cCDyakj7ZDnUoJTj
>>> 4_CEmMTpCkY81s/edit?usp=sharing
>

Re: [DISCUSS] Release testing procedures, Flink 1.3.2

Reply via email to