Martijn Visser created FLINK-40011:
--------------------------------------

             Summary: 
CommonExecSinkITCase.testStreamRecordTimestampInserterSinkRuntimeProvider is 
flaky: test source can finish before NoMoreSplitsEvent is delivered
                 Key: FLINK-40011
                 URL: https://issues.apache.org/jira/browse/FLINK-40011
             Project: Flink
          Issue Type: Bug
          Components: Table SQL / Planner, Tests
    Affects Versions: 2.4.0
            Reporter: Martijn Visser
            Assignee: Martijn Visser


testStreamRecordTimestampInserterSinkRuntimeProvider fails intermittently when 
the
job fails instead of finishing.

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=76431&view=results
 (leg: test_ci table)

{code}
org.apache.flink.table.api.TableException: Failed to wait job finish
  at 
CommonExecSinkITCase.testStreamRecordTimestampInserterSinkRuntimeProvider(CommonExecSinkITCase.java:127)
Caused by: JobExecutionException: Job execution failed.
Caused by: org.apache.flink.runtime.JobException: Recovery is suppressed by 
NoRestartBackoffTimeStrategy
Caused by: org.apache.flink.util.FlinkException: An OperatorEvent from an 
OperatorCoordinator to a task was lost. Triggering task failover to ensure 
consistency. Event: '[NoMoreSplitEvent]', targetTask: Source: T1[154] -> 
WatermarkAssigner[155] -> StreamRecordTimestampInserter[156] -> T1[156]: Writer 
(1/4) - execution #0
Caused by: 
org.apache.flink.runtime.operators.coordination.TaskNotRunningException: Task 
is not running, but in state FINISHED
{code}

Root cause: the in-test TestSource reader returned END_OF_INPUT as soon as it 
had
emitted its rows. The SingleSplitEnumerator assigns the split and then delivers 
a
NoMoreSplitsEvent as a separate event via the coordinator thread. If the 
split-holding
subtask (here, (1/4)) finishes before that event is delivered, the coordinator 
sends
it to an already-FINISHED task; the framework treats the undeliverable event as 
a lost
OperatorEvent and triggers a failover, which the job's 
NoRestartBackoffTimeStrategy
turns into a hard failure. No data is lost (all rows are emitted) — this is a 
benign
test-config race, not a coordinator bug.

Fix: defer END_OF_INPUT until notifyNoMoreSplits() has been received, so the 
reader
cannot finish before the event lands (mirroring 
SourceReaderBase.finishedOrAvailableLater()).
First observed in build 76431; the fix is justified on the structural race, not 
frequency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to