[ 
https://issues.apache.org/jira/browse/FLINK-29365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Paul updated FLINK-29365:
--------------------------------
    Affects Version/s: 1.15.3
                           (was: 1.15.2)

> Millisecond behind latest jumps after Flink 1.15.2 upgrade
> ----------------------------------------------------------
>
>                 Key: FLINK-29365
>                 URL: https://issues.apache.org/jira/browse/FLINK-29365
>             Project: Flink
>          Issue Type: Bug
>          Components: Connectors / Kinesis
>    Affects Versions: 1.15.3
>         Environment: Redeployment from 1.14.4 to 1.15.2
>            Reporter: Wilson Wu
>            Priority: Major
>         Attachments: 2022-09-14T17_00_00.000Z - 
> 2022-09-14T18_03_44.089Z_14.4-15.2.numbers, 2022-09-15T12_49_54.686Z - 
> 2022-09-15T14_57_44.089Z_15.2-15.2.numbers, Screen Shot 2022-09-19 at 2.50.56 
> PM.png
>
>
> (First time filling a ticket in Flink community, please let me know if there 
> are any guidelines I need to follow)
> I noticed a very strange behavior with a recent version bump from Flink 
> 1.14.4 to 1.15.2. My project consumes around 30K records per second from a 
> sharded kinesis stream, and during the version upgrade, it will follow the 
> best practice to first trigger a savepoint from the running job, start the 
> new job from the savepoint and then remove the old job. So far so good, and 
> the above logic has been tested multiple times without any issue for 1.14.4. 
> Usually, after the version upgrade, our job will have a few minutes delay for 
> millisecond behind latest, but it will catch up with the speed quickly(within 
> 30mins). Our savepoint is around one hundred MBs big, and our job DAG will 
> become 90 - 100% busy with some backpressure when we redeploy but after 10-20 
> minutes it goes back to normal.
> Then the strange thing happened, when I tried to redeploy with 1.15.2 upgrade 
> from a running 1.14.4 job, I can see a savepoint has been created and the new 
> job is running, all the metrics look fine, except suddenly [millisecond 
> behind the 
> latest|https://flink.apache.org/news/2019/02/25/monitoring-best-practices.html]
>  jumps to 10 hours!! and it takes days for my application to catch up with 
> the kinesis stream latest record. I don't understand why it jumps from 0 
> second to 10+ hours when we restart the new job. The only main change I 
> introduced with version bump is to change 
> [failOnError|https://nightlies.apache.org/flink/flink-docs-release-1.15/api/java/org/apache/flink/connector/kinesis/sink/KinesisStreamsSink.html]
>  from true to false, but I don't think this is the root cause.
> I tried to redeploy the new 1.15.2 job by changing our parallelism, 
> redeploying a job from 1.15.2 does not introduce a big delay, so I assume the 
> issue above only happens when we bump version from 1.14.4 to 1.15.2(note the 
> attached screenshot)? I did try to bump it twice and I see the same 10hrs+ 
> jump in delay, we do not have changes related to any timezones.
> Please let me know if this can be filled as a bug, as I do not have a running 
> project with all the kinesis setup available that can reproduce the issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to