Hi,

Yes! In my opinion, the most critical issues are these:

 - https://issues.apache.org/jira/browse/FLINK-6964: 
<https://issues.apache.org/jira/browse/FLINK-6964:> Fix recovery for 
incremental checkpoints in StandaloneCompletedCheckpointStore
 - https://issues.apache.org/jira/browse/FLINK-7041: 
<https://issues.apache.org/jira/browse/FLINK-7041:> Deserialize StateBackend 
from JobCheckpointingSettings with user classloader

The first one makes incremental checkpoints on RocksDB unusable with 
externalised checkpoints. The latter means that you cannot have custom 
configuration of the RocksDB backend.

 - https://issues.apache.org/jira/browse/FLINK-7216: 
<https://issues.apache.org/jira/browse/FLINK-7216:> ExecutionGraph can perform 
concurrent global restarts to scheduling
 - https://issues.apache.org/jira/browse/FLINK-7153: 
<https://issues.apache.org/jira/browse/FLINK-7153:> Eager Scheduling can't 
allocate source for ExecutionGraph correctly

These are critical scheduler bugs, Stephan can probably say more about them 
than I can.

 - https://issues.apache.org/jira/browse/FLINK-7143: 
<https://issues.apache.org/jira/browse/FLINK-7143:> Partition assignment for 
Kafka consumer is not stable
 - https://issues.apache.org/jira/browse/FLINK-7195: 
<https://issues.apache.org/jira/browse/FLINK-7195:> FlinkKafkaConsumer should 
not respect fetched partitions to filter restored partition states
 - https://issues.apache.org/jira/browse/FLINK-6996: 
<https://issues.apache.org/jira/browse/FLINK-6996:> FlinkKafkaProducer010 
doesn't guarantee at-least-once semantic

The first one means that you can have duplicate data because several consumers 
would be consuming from one partition, without noticing it. The second one 
causes partitions to be dropped from state if a broker is temporarily not 
reachable.

The first two issues would have been caught by my proposed testing procedures, 
the third and fourth might be caught but are very tricky to provoke. I’m 
currently experimenting with this testing procedure to Flink 1.3.1 to see if I 
can provoke it.

The Kafka bugs are super hard to provoke because they only occur if Kafka has 
some temporary problems or there are communication problems.

I forgot to mention that I have actually two goals with this: 1) thoroughly 
test Flink and 2) build expertise in the community, i.e. we’re forced to try 
cluster environments/distributions that we are not familiar with and we 
actually deploy a full job and play around with it.

Best,
Aljoscha


> On 19. Jul 2017, at 15:49, Shaoxuan Wang <shaox...@apache.org> wrote:
> 
> Hi Aljoscha,
> Glad to see that we have a more thorough testing procedure. Could you
> please share us what (the critical issues you mentioned) have been broken
> in 1.3.0 & 1.3.1, and how the new proposed "functional testing section and
> a combination of systems/configurations" can cover this. This will help us
> to improve our production verification as well.
> 
> Regards,
> Shaoxuan
> 
> 
> On Wed, Jul 19, 2017 at 9:11 PM, Aljoscha Krettek <aljos...@apache.org>
> wrote:
> 
>> Hi Everyone,
>> 
>> We are on the verge of starting the release process for Flink 1.3.2 and
>> there have been some critical issues in both Flink 1.3.0 and 1.3.1. For
>> Flink 1.3.2 I want to make very sure that we test as much as possible. For
>> this I’m proposing a slightly changed testing procedure [1]. This is
>> similar to the testing document we used for previous releases but has a new
>> functional testing section that tries to outline a testing procedure and a
>> combination of systems/configurations that we have to test. Please have a
>> look and comment on whether you think this is sufficient (or a bit too
>> much).
>> 
>> What do you think?
>> 
>> Best,
>> Aljoscha
>> 
>> [1] https://docs.google.com/document/d/16fU1cpxoYf3o9cCDyakj7ZDnUoJTj
>> 4_CEmMTpCkY81s/edit?usp=sharing

Reply via email to