Hey Martijn,

Thanks a lot for getting back to us. To give you a little bit more context,
we do maintain an open-source project around flink dagger
<http://github.com/odpf/dagger> which is a wrapper for proto processing. As
part of the upgrade to the latest version, we did some refactoring and
moved to KafkaSource since the older FlinkKafkaConsumer was getting
deprecated.

So we currently do not have any set up to test the hypothesis. Also just
increasing the resources by a bit fixes it and it does happen with a small
set of jobs during high traffic.

We would love to get some input from the community as it might cause errors
in some of the jobs in production.

Thanks and regards,
//arujit

On Tue, Feb 15, 2022 at 8:48 PM Martijn Visser <mart...@ververica.com>
wrote:

> Hi Arujit,
>
> I'm also looping in some contributors from the connector and runtime
> perspective in this thread. Did you also test the upgrade first by only
> upgrading to Flink 1.14 and keeping the FlinkKafkaConsumer? That would
> offer a better way to determine if a regression is caused by the upgrade of
> Flink or because of the change in connector.
>
> Best regards,
>
> Martijn Visser
> https://twitter.com/MartijnVisser82
>
>
> On Tue, 15 Feb 2022 at 13:07, Arujit Pradhan <arujit.prad...@gojek.com>
> wrote:
>
>> Hey team,
>>
>> We are migrating our Flink codebase and a bunch of jobs from Flink-1.9 to
>> Flink-1.14. To ensure uniformity in performance we ran a bunch of jobs for
>> a week both in 1.9 and 1.14 simultaneously with the same resources and
>> configurations and monitored them.
>>
>> Though most of the jobs are running fine, we have significant performance
>> degradation in some of the high throughput jobs during peak hours. As a
>> result, we can see high lag and data drops while processing messages from
>> Kafka in some of the jobs in 1.14 while in 1.9 they are working just fine.
>> Now we are debugging and trying to understand the potential reason for it.
>>
>> One of the hypotheses that we can think of is the change in the sequence
>> of processing in the source-operator. To explain this, adding screenshots
>> for the problematic tasks below.
>> The first one is for 1.14 and the second is for 1.9. Upon inspection, it
>> can be seen the sequence of processing 1.14 is -
>>
>> data_streams_0 -> Timestamps/Watermarks -> Filter -> Select.
>>
>> While in 1.9 it was,
>>
>> data_streams_0 -> Filter -> Timestamps/Watermarks -> Select.
>>
>> In 1.14 we are using KafkaSource API while in the older version it was
>> FlinkKafkaConsumer API. Wanted to understand if it can cause potential
>> performance decline as all other configurations/resources for both of the
>> jobs are identical and if so then how to avoid it. Also, we can not see any
>> unusual behaviour for the CPU/Memory while monitoring the affected jobs.
>>
>> Source Operator in 1.14 :
>> [image: image.png]
>> Source Operator in 1.9 :
>> [image: image.png]
>> Thanks in advance,
>> //arujit
>>
>>
>>
>>
>>
>>
>>

Reply via email to