Re: [DISCUSS] Release testing procedures, Flink 1.3.2

Piotr Nowojski Wed, 26 Jul 2017 04:45:39 -0700

By setting 

properties.setProperty("batch.size", "10240000");
properties.setProperty("linger.ms", "10000");


In the properties passed to FlinkKafkaProducer010 (to postpone automatic 
flushing) and killing (kill -9 PID) the YarnTaskManager process in the middle 
of executing a Flink job. Thus records that were not flushed automatically were 
lost (this at-least-once bug was about missing manual/explicit flush on a 
checkpoint).

Piotrek

> On Jul 26, 2017, at 1:19 PM, Aljoscha Krettek <aljos...@apache.org> wrote:
> 
> Sweet (maybe?)! How did you reproduce data-loss?
> 
> Best,
> Aljoscha
> 
>> On 26. Jul 2017, at 11:13, Piotr Nowojski <pi...@data-artisans.com> wrote:
>> 
>> It took me longer then I expected but I was able reproduce data loss with 
>> older Flink versions while running fling in 3 nodes cluster. I have also 
>> validated that at-least-once semantic is fixed for Kafka 0.10 in Flink 
>> 1.3-SNAPSHOT.
>> 
>> Piotrek
>> 
>>> On Jul 20, 2017, at 4:52 PM, Stephan Ewen <se...@apache.org> wrote:
>>> 
>>> Thank you very much, for driving this!
>>> 
>>> On Thu, Jul 20, 2017 at 9:09 AM, Piotr Nowojski <pi...@data-artisans.com>
>>> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> Regarding Kafka at-least-once bug. I could try to play with Flink 1.3.1 on
>>>> a real cluster to provoke this bug, by basically repeating
>>>> KafkaProducerTestBase#testOneToOneAtLeastOnce on a larger scale.
>>>> 
>>>> Piotrek
>>>> 
>>>>> On Jul 19, 2017, at 5:26 PM, Aljoscha Krettek <aljos...@apache.org>
>>>> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> Yes! In my opinion, the most critical issues are these:
>>>>> 
>>>>> - https://issues.apache.org/jira/browse/FLINK-6964: <
>>>> https://issues.apache.org/jira/browse/FLINK-6964:> Fix recovery for
>>>> incremental checkpoints in StandaloneCompletedCheckpointStore
>>>>> - https://issues.apache.org/jira/browse/FLINK-7041: <
>>>> https://issues.apache.org/jira/browse/FLINK-7041:> Deserialize
>>>> StateBackend from JobCheckpointingSettings with user classloader
>>>>> 
>>>>> The first one makes incremental checkpoints on RocksDB unusable with
>>>> externalised checkpoints. The latter means that you cannot have custom
>>>> configuration of the RocksDB backend.
>>>>> 
>>>>> - https://issues.apache.org/jira/browse/FLINK-7216: <
>>>> https://issues.apache.org/jira/browse/FLINK-7216:> ExecutionGraph can
>>>> perform concurrent global restarts to scheduling
>>>>> - https://issues.apache.org/jira/browse/FLINK-7153: <
>>>> https://issues.apache.org/jira/browse/FLINK-7153:> Eager Scheduling can't
>>>> allocate source for ExecutionGraph correctly
>>>>> 
>>>>> These are critical scheduler bugs, Stephan can probably say more about
>>>> them than I can.
>>>>> 
>>>>> - https://issues.apache.org/jira/browse/FLINK-7143: <
>>>> https://issues.apache.org/jira/browse/FLINK-7143:> Partition assignment
>>>> for Kafka consumer is not stable
>>>>> - https://issues.apache.org/jira/browse/FLINK-7195: <
>>>> https://issues.apache.org/jira/browse/FLINK-7195:> FlinkKafkaConsumer
>>>> should not respect fetched partitions to filter restored partition states
>>>>> - https://issues.apache.org/jira/browse/FLINK-6996: <
>>>> https://issues.apache.org/jira/browse/FLINK-6996:> FlinkKafkaProducer010
>>>> doesn't guarantee at-least-once semantic
>>>>> 
>>>>> The first one means that you can have duplicate data because several
>>>> consumers would be consuming from one partition, without noticing it. The
>>>> second one causes partitions to be dropped from state if a broker is
>>>> temporarily not reachable.
>>>>> 
>>>>> The first two issues would have been caught by my proposed testing
>>>> procedures, the third and fourth might be caught but are very tricky to
>>>> provoke. I’m currently experimenting with this testing procedure to Flink
>>>> 1.3.1 to see if I can provoke it.
>>>>> 
>>>>> The Kafka bugs are super hard to provoke because they only occur if
>>>> Kafka has some temporary problems or there are communication problems.
>>>>> 
>>>>> I forgot to mention that I have actually two goals with this: 1)
>>>> thoroughly test Flink and 2) build expertise in the community, i.e. we’re
>>>> forced to try cluster environments/distributions that we are not familiar
>>>> with and we actually deploy a full job and play around with it.
>>>>> 
>>>>> Best,
>>>>> Aljoscha
>>>>> 
>>>>> 
>>>>>> On 19. Jul 2017, at 15:49, Shaoxuan Wang <shaox...@apache.org> wrote:
>>>>>> 
>>>>>> Hi Aljoscha,
>>>>>> Glad to see that we have a more thorough testing procedure. Could you
>>>>>> please share us what (the critical issues you mentioned) have been
>>>> broken
>>>>>> in 1.3.0 & 1.3.1, and how the new proposed "functional testing section
>>>> and
>>>>>> a combination of systems/configurations" can cover this. This will help
>>>> us
>>>>>> to improve our production verification as well.
>>>>>> 
>>>>>> Regards,
>>>>>> Shaoxuan
>>>>>> 
>>>>>> 
>>>>>> On Wed, Jul 19, 2017 at 9:11 PM, Aljoscha Krettek <aljos...@apache.org>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi Everyone,
>>>>>>> 
>>>>>>> We are on the verge of starting the release process for Flink 1.3.2 and
>>>>>>> there have been some critical issues in both Flink 1.3.0 and 1.3.1. For
>>>>>>> Flink 1.3.2 I want to make very sure that we test as much as possible.
>>>> For
>>>>>>> this I’m proposing a slightly changed testing procedure [1]. This is
>>>>>>> similar to the testing document we used for previous releases but has
>>>> a new
>>>>>>> functional testing section that tries to outline a testing procedure
>>>> and a
>>>>>>> combination of systems/configurations that we have to test. Please
>>>> have a
>>>>>>> look and comment on whether you think this is sufficient (or a bit too
>>>>>>> much).
>>>>>>> 
>>>>>>> What do you think?
>>>>>>> 
>>>>>>> Best,
>>>>>>> Aljoscha
>>>>>>> 
>>>>>>> [1] https://docs.google.com/document/d/16fU1cpxoYf3o9cCDyakj7ZDnUoJTj
>>>>>>> 4_CEmMTpCkY81s/edit?usp=sharing
>>>>> 
>>>> 
>>>> 
>> 
>

Re: [DISCUSS] Release testing procedures, Flink 1.3.2

Reply via email to