[ https://issues.apache.org/jira/browse/FLINK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17118281#comment-17118281 ]
Xintong Song edited comment on FLINK-15687 at 5/28/20, 5:54 AM: ---------------------------------------------------------------- I've printed the thread names for all testing scope accesses to {{TaskSlotTable}}. The following cases are discovered with the reported problem. I'll try to provide fixes asap. * -TaskExecutorTest- ** -testDynamicSlotAllocation- ** -testOfferSlotToJobMasterAfterTimeout- ** -testShouldShutDownTaskManagerServicesInPostStop- * TaskExecutorSubmissionTest ** testFailingScheduleOrUpdateConsumers ** testUpdateTaskInputPartitionsFailure ** testRunJobWithForwardChannel ** testTaskSubmissionAndCancelling ** testCancellingDependentAndStateUpdateFails ** testLocalPartitionNotFound ** testRemotePartitionNotFound ** testTaskSubmission ** testGateChannelEdgeMismatch ** testRequestTaskBackPressure * TaskExecutorOperatorEventHandlingTest ** eventHandlingInTaskFailureFailsTask ** eventToCoordinatorDeliveryFailureFailsTask *EDIT:* Looking more into these cases, I think the three cases of {{TaskExecutorTest}} should be fine. Accesses to {{TaskSlotTable}} from the testing thread are guaranteed ({{CompletableFuture#get}}, {{OneShotLatch#await}}) to take place after the main thread activities are finished. was (Author: xintongsong): I've printed the thread names for all testing scope accesses to {{TaskSlotTable}}. The following cases are discovered with the reported problem. I'll try to provide fixes asap. * TaskExecutorTest ** testDynamicSlotAllocation ** testOfferSlotToJobMasterAfterTimeout ** testShouldShutDownTaskManagerServicesInPostStop * TaskExecutorSubmissionTest ** testFailingScheduleOrUpdateConsumers ** testUpdateTaskInputPartitionsFailure ** testRunJobWithForwardChannel ** testTaskSubmissionAndCancelling ** testCancellingDependentAndStateUpdateFails ** testLocalPartitionNotFound ** testRemotePartitionNotFound ** testTaskSubmission ** testGateChannelEdgeMismatch ** testRequestTaskBackPressure * TaskExecutorOperatorEventHandlingTest ** eventHandlingInTaskFailureFailsTask ** eventToCoordinatorDeliveryFailureFailsTask > Potential test instabilities due to concurrent access to TaskSlotTable. > ----------------------------------------------------------------------- > > Key: FLINK-15687 > URL: https://issues.apache.org/jira/browse/FLINK-15687 > Project: Flink > Issue Type: Task > Components: Runtime / Coordination, Tests > Affects Versions: 1.10.0 > Reporter: Kostas Kloudas > Assignee: Xintong Song > Priority: Critical > Labels: test-stability > Fix For: 1.11.0 > > > Working on [FLINK-14742|https://issues.apache.org/jira/browse/FLINK-14742] > revealed that the problem with that test instability was the modification of > the {{taskSlotTable}} of the {{TaskManager}} under test from multiple > threads, namely the test thread and the main thread of the {{rpcEnpoint}}. > This data-structure is not thread-safe and this should not happen. > This anti-pattern seems to be repeated in multiple tests like most of the > tests in the {{TaskExecutorSubmissionTest}} (look for the call to the > {{TaskSlotTable.allocateSlot()}}). There we seem to call > {{taskSlotTable.allocateSlot()}} and then \{{tmGateway.submitTask()}} which > is essentially accessing the slot table from within the main rpc-endpoint > thread. > This JIRA is just to investigate if this is also a problem in those tests or > not. > cc [~trohrmann], [~chesnay] , [~yangwang166] -- This message was sent by Atlassian Jira (v8.3.4#803005)