[
https://issues.apache.org/jira/browse/FLINK-40011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Martijn Visser resolved FLINK-40011.
------------------------------------
Fix Version/s: 2.4.0
Resolution: Fixed
Fixed in apache/flink:master 063b46ef66843868bde7de9f9366ffff3e53055e
> CommonExecSinkITCase.testStreamRecordTimestampInserterSinkRuntimeProvider is
> flaky: test source can finish before NoMoreSplitsEvent is delivered
> ------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-40011
> URL: https://issues.apache.org/jira/browse/FLINK-40011
> Project: Flink
> Issue Type: Bug
> Components: Table SQL / Planner, Tests
> Affects Versions: 2.4.0
> Reporter: Martijn Visser
> Assignee: Martijn Visser
> Priority: Major
> Labels: pull-request-available
> Fix For: 2.4.0
>
>
> testStreamRecordTimestampInserterSinkRuntimeProvider fails intermittently
> when the
> job fails instead of finishing.
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=76431&view=results
> (leg: test_ci table)
> {code}
> org.apache.flink.table.api.TableException: Failed to wait job finish
> at
> CommonExecSinkITCase.testStreamRecordTimestampInserterSinkRuntimeProvider(CommonExecSinkITCase.java:127)
> Caused by: JobExecutionException: Job execution failed.
> Caused by: org.apache.flink.runtime.JobException: Recovery is suppressed by
> NoRestartBackoffTimeStrategy
> Caused by: org.apache.flink.util.FlinkException: An OperatorEvent from an
> OperatorCoordinator to a task was lost. Triggering task failover to ensure
> consistency. Event: '[NoMoreSplitEvent]', targetTask: Source: T1[154] ->
> WatermarkAssigner[155] -> StreamRecordTimestampInserter[156] -> T1[156]:
> Writer (1/4) - execution #0
> Caused by:
> org.apache.flink.runtime.operators.coordination.TaskNotRunningException: Task
> is not running, but in state FINISHED
> {code}
> Root cause: the in-test TestSource reader returned END_OF_INPUT as soon as it
> had
> emitted its rows. The SingleSplitEnumerator assigns the split and then
> delivers a
> NoMoreSplitsEvent as a separate event via the coordinator thread. If the
> split-holding
> subtask (here, (1/4)) finishes before that event is delivered, the
> coordinator sends
> it to an already-FINISHED task; the framework treats the undeliverable event
> as a lost
> OperatorEvent and triggers a failover, which the job's
> NoRestartBackoffTimeStrategy
> turns into a hard failure. No data is lost (all rows are emitted) — this is a
> benign
> test-config race, not a coordinator bug.
> Fix: defer END_OF_INPUT until notifyNoMoreSplits() has been received, so the
> reader
> cannot finish before the event lands (mirroring
> SourceReaderBase.finishedOrAvailableLater()).
> First observed in build 76431; the fix is justified on the structural race,
> not frequency.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)