Sweet (maybe?)! How did you reproduce data-loss?

Best,
Aljoscha

> On 26. Jul 2017, at 11:13, Piotr Nowojski <pi...@data-artisans.com> wrote:
> 
> It took me longer then I expected but I was able reproduce data loss with 
> older Flink versions while running fling in 3 nodes cluster. I have also 
> validated that at-least-once semantic is fixed for Kafka 0.10 in Flink 
> 1.3-SNAPSHOT.
> 
> Piotrek
> 
>> On Jul 20, 2017, at 4:52 PM, Stephan Ewen <se...@apache.org> wrote:
>> 
>> Thank you very much, for driving this!
>> 
>> On Thu, Jul 20, 2017 at 9:09 AM, Piotr Nowojski <pi...@data-artisans.com>
>> wrote:
>> 
>>> Hi,
>>> 
>>> Regarding Kafka at-least-once bug. I could try to play with Flink 1.3.1 on
>>> a real cluster to provoke this bug, by basically repeating
>>> KafkaProducerTestBase#testOneToOneAtLeastOnce on a larger scale.
>>> 
>>> Piotrek
>>> 
>>>> On Jul 19, 2017, at 5:26 PM, Aljoscha Krettek <aljos...@apache.org>
>>> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> Yes! In my opinion, the most critical issues are these:
>>>> 
>>>> - https://issues.apache.org/jira/browse/FLINK-6964: <
>>> https://issues.apache.org/jira/browse/FLINK-6964:> Fix recovery for
>>> incremental checkpoints in StandaloneCompletedCheckpointStore
>>>> - https://issues.apache.org/jira/browse/FLINK-7041: <
>>> https://issues.apache.org/jira/browse/FLINK-7041:> Deserialize
>>> StateBackend from JobCheckpointingSettings with user classloader
>>>> 
>>>> The first one makes incremental checkpoints on RocksDB unusable with
>>> externalised checkpoints. The latter means that you cannot have custom
>>> configuration of the RocksDB backend.
>>>> 
>>>> - https://issues.apache.org/jira/browse/FLINK-7216: <
>>> https://issues.apache.org/jira/browse/FLINK-7216:> ExecutionGraph can
>>> perform concurrent global restarts to scheduling
>>>> - https://issues.apache.org/jira/browse/FLINK-7153: <
>>> https://issues.apache.org/jira/browse/FLINK-7153:> Eager Scheduling can't
>>> allocate source for ExecutionGraph correctly
>>>> 
>>>> These are critical scheduler bugs, Stephan can probably say more about
>>> them than I can.
>>>> 
>>>> - https://issues.apache.org/jira/browse/FLINK-7143: <
>>> https://issues.apache.org/jira/browse/FLINK-7143:> Partition assignment
>>> for Kafka consumer is not stable
>>>> - https://issues.apache.org/jira/browse/FLINK-7195: <
>>> https://issues.apache.org/jira/browse/FLINK-7195:> FlinkKafkaConsumer
>>> should not respect fetched partitions to filter restored partition states
>>>> - https://issues.apache.org/jira/browse/FLINK-6996: <
>>> https://issues.apache.org/jira/browse/FLINK-6996:> FlinkKafkaProducer010
>>> doesn't guarantee at-least-once semantic
>>>> 
>>>> The first one means that you can have duplicate data because several
>>> consumers would be consuming from one partition, without noticing it. The
>>> second one causes partitions to be dropped from state if a broker is
>>> temporarily not reachable.
>>>> 
>>>> The first two issues would have been caught by my proposed testing
>>> procedures, the third and fourth might be caught but are very tricky to
>>> provoke. I’m currently experimenting with this testing procedure to Flink
>>> 1.3.1 to see if I can provoke it.
>>>> 
>>>> The Kafka bugs are super hard to provoke because they only occur if
>>> Kafka has some temporary problems or there are communication problems.
>>>> 
>>>> I forgot to mention that I have actually two goals with this: 1)
>>> thoroughly test Flink and 2) build expertise in the community, i.e. we’re
>>> forced to try cluster environments/distributions that we are not familiar
>>> with and we actually deploy a full job and play around with it.
>>>> 
>>>> Best,
>>>> Aljoscha
>>>> 
>>>> 
>>>>> On 19. Jul 2017, at 15:49, Shaoxuan Wang <shaox...@apache.org> wrote:
>>>>> 
>>>>> Hi Aljoscha,
>>>>> Glad to see that we have a more thorough testing procedure. Could you
>>>>> please share us what (the critical issues you mentioned) have been
>>> broken
>>>>> in 1.3.0 & 1.3.1, and how the new proposed "functional testing section
>>> and
>>>>> a combination of systems/configurations" can cover this. This will help
>>> us
>>>>> to improve our production verification as well.
>>>>> 
>>>>> Regards,
>>>>> Shaoxuan
>>>>> 
>>>>> 
>>>>> On Wed, Jul 19, 2017 at 9:11 PM, Aljoscha Krettek <aljos...@apache.org>
>>>>> wrote:
>>>>> 
>>>>>> Hi Everyone,
>>>>>> 
>>>>>> We are on the verge of starting the release process for Flink 1.3.2 and
>>>>>> there have been some critical issues in both Flink 1.3.0 and 1.3.1. For
>>>>>> Flink 1.3.2 I want to make very sure that we test as much as possible.
>>> For
>>>>>> this I’m proposing a slightly changed testing procedure [1]. This is
>>>>>> similar to the testing document we used for previous releases but has
>>> a new
>>>>>> functional testing section that tries to outline a testing procedure
>>> and a
>>>>>> combination of systems/configurations that we have to test. Please
>>> have a
>>>>>> look and comment on whether you think this is sufficient (or a bit too
>>>>>> much).
>>>>>> 
>>>>>> What do you think?
>>>>>> 
>>>>>> Best,
>>>>>> Aljoscha
>>>>>> 
>>>>>> [1] https://docs.google.com/document/d/16fU1cpxoYf3o9cCDyakj7ZDnUoJTj
>>>>>> 4_CEmMTpCkY81s/edit?usp=sharing
>>>> 
>>> 
>>> 
> 

Reply via email to