[ 
https://issues.apache.org/jira/browse/FLINK-40011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn Visser resolved FLINK-40011.
------------------------------------
    Fix Version/s: 2.4.0
       Resolution: Fixed

Fixed in apache/flink:master 063b46ef66843868bde7de9f9366ffff3e53055e

> CommonExecSinkITCase.testStreamRecordTimestampInserterSinkRuntimeProvider is 
> flaky: test source can finish before NoMoreSplitsEvent is delivered
> ------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-40011
>                 URL: https://issues.apache.org/jira/browse/FLINK-40011
>             Project: Flink
>          Issue Type: Bug
>          Components: Table SQL / Planner, Tests
>    Affects Versions: 2.4.0
>            Reporter: Martijn Visser
>            Assignee: Martijn Visser
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.4.0
>
>
> testStreamRecordTimestampInserterSinkRuntimeProvider fails intermittently 
> when the
> job fails instead of finishing.
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=76431&view=results
>  (leg: test_ci table)
> {code}
> org.apache.flink.table.api.TableException: Failed to wait job finish
>   at 
> CommonExecSinkITCase.testStreamRecordTimestampInserterSinkRuntimeProvider(CommonExecSinkITCase.java:127)
> Caused by: JobExecutionException: Job execution failed.
> Caused by: org.apache.flink.runtime.JobException: Recovery is suppressed by 
> NoRestartBackoffTimeStrategy
> Caused by: org.apache.flink.util.FlinkException: An OperatorEvent from an 
> OperatorCoordinator to a task was lost. Triggering task failover to ensure 
> consistency. Event: '[NoMoreSplitEvent]', targetTask: Source: T1[154] -> 
> WatermarkAssigner[155] -> StreamRecordTimestampInserter[156] -> T1[156]: 
> Writer (1/4) - execution #0
> Caused by: 
> org.apache.flink.runtime.operators.coordination.TaskNotRunningException: Task 
> is not running, but in state FINISHED
> {code}
> Root cause: the in-test TestSource reader returned END_OF_INPUT as soon as it 
> had
> emitted its rows. The SingleSplitEnumerator assigns the split and then 
> delivers a
> NoMoreSplitsEvent as a separate event via the coordinator thread. If the 
> split-holding
> subtask (here, (1/4)) finishes before that event is delivered, the 
> coordinator sends
> it to an already-FINISHED task; the framework treats the undeliverable event 
> as a lost
> OperatorEvent and triggers a failover, which the job's 
> NoRestartBackoffTimeStrategy
> turns into a hard failure. No data is lost (all rows are emitted) — this is a 
> benign
> test-config race, not a coordinator bug.
> Fix: defer END_OF_INPUT until notifyNoMoreSplits() has been received, so the 
> reader
> cannot finish before the event lands (mirroring 
> SourceReaderBase.finishedOrAvailableLater()).
> First observed in build 76431; the fix is justified on the structural race, 
> not frequency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to