[ https://issues.apache.org/jira/browse/FLINK-22488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated FLINK-22488: ----------------------------------- Labels: pull-request-available (was: ) > KafkaSourceLegacyITCase.testOneToOneSources failed due to "OperatorEvent from > an OperatorCoordinator to a task was lost" > ------------------------------------------------------------------------------------------------------------------------ > > Key: FLINK-22488 > URL: https://issues.apache.org/jira/browse/FLINK-22488 > Project: Flink > Issue Type: Improvement > Reporter: Dong Lin > Priority: Minor > Labels: pull-request-available > > According to [1], the test KafkaSourceLegacyITCase.testOneToOneSources failed > because it runs a streaming job (which uses KafkaSource) with > restartAttempts=1. In addition to the failover explicitly triggered by the > FailingIdentityMapper, the job additionally failed due to > "org.apache.flink.util.FlinkException: An OperatorEvent from an > OperatorCoordinator to a task was lost. Triggering task failover to ensure > consistency", which is unexpected by the test. > Note that SubtaskGatewayImpl was updated by [2] on 4/14 which triggers task > failover if any OperatorEvent was lost. This could explain why those Kafka > tests start to fail due to the exception described above. > In order to make this test stable, let's try to understand why there is such > a high chance of loosing OperatorEvent in the Azure test pipeline. And if we > could not avoid loosing OperatorEvent in the test pipeline, we probably need > to update the test to allow the pipeline being restarted arbitrary times (and > still be able to stop the test on the happy path). > [1] > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17212&view=logs&j=c5f0071e-1851-543e-9a45-9ac140befc32&t=1fb1a56f-e8b5-5a82-00a0-a2db7757b4f5&l=6960 > [2] https://github.com/apache/flink/pull/15605 -- This message was sent by Atlassian Jira (v8.3.4#803005)