[ https://issues.apache.org/jira/browse/FLINK-29365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Fabian Paul updated FLINK-29365: -------------------------------- Affects Version/s: 1.15.3 (was: 1.15.2) > Millisecond behind latest jumps after Flink 1.15.2 upgrade > ---------------------------------------------------------- > > Key: FLINK-29365 > URL: https://issues.apache.org/jira/browse/FLINK-29365 > Project: Flink > Issue Type: Bug > Components: Connectors / Kinesis > Affects Versions: 1.15.3 > Environment: Redeployment from 1.14.4 to 1.15.2 > Reporter: Wilson Wu > Priority: Major > Attachments: 2022-09-14T17_00_00.000Z - > 2022-09-14T18_03_44.089Z_14.4-15.2.numbers, 2022-09-15T12_49_54.686Z - > 2022-09-15T14_57_44.089Z_15.2-15.2.numbers, Screen Shot 2022-09-19 at 2.50.56 > PM.png > > > (First time filling a ticket in Flink community, please let me know if there > are any guidelines I need to follow) > I noticed a very strange behavior with a recent version bump from Flink > 1.14.4 to 1.15.2. My project consumes around 30K records per second from a > sharded kinesis stream, and during the version upgrade, it will follow the > best practice to first trigger a savepoint from the running job, start the > new job from the savepoint and then remove the old job. So far so good, and > the above logic has been tested multiple times without any issue for 1.14.4. > Usually, after the version upgrade, our job will have a few minutes delay for > millisecond behind latest, but it will catch up with the speed quickly(within > 30mins). Our savepoint is around one hundred MBs big, and our job DAG will > become 90 - 100% busy with some backpressure when we redeploy but after 10-20 > minutes it goes back to normal. > Then the strange thing happened, when I tried to redeploy with 1.15.2 upgrade > from a running 1.14.4 job, I can see a savepoint has been created and the new > job is running, all the metrics look fine, except suddenly [millisecond > behind the > latest|https://flink.apache.org/news/2019/02/25/monitoring-best-practices.html] > jumps to 10 hours!! and it takes days for my application to catch up with > the kinesis stream latest record. I don't understand why it jumps from 0 > second to 10+ hours when we restart the new job. The only main change I > introduced with version bump is to change > [failOnError|https://nightlies.apache.org/flink/flink-docs-release-1.15/api/java/org/apache/flink/connector/kinesis/sink/KinesisStreamsSink.html] > from true to false, but I don't think this is the root cause. > I tried to redeploy the new 1.15.2 job by changing our parallelism, > redeploying a job from 1.15.2 does not introduce a big delay, so I assume the > issue above only happens when we bump version from 1.14.4 to 1.15.2(note the > attached screenshot)? I did try to bump it twice and I see the same 10hrs+ > jump in delay, we do not have changes related to any timezones. > Please let me know if this can be filled as a bug, as I do not have a running > project with all the kinesis setup available that can reproduce the issue. -- This message was sent by Atlassian Jira (v8.20.10#820010)