[ https://issues.apache.org/jira/browse/FLINK-22568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17402295#comment-17402295 ]
Till Rohrmann commented on FLINK-22568: --------------------------------------- I think the problem is our CI infrastructure because we have a gap of 14s between receiving the savepoint command and the actual triggering: {code} 12:59:25,726 [flink-akka.actor.default-dispatcher-5] INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Triggering savepoint for job 2e88260d7fd42515ac6b8181788b2583. 12:59:27,579 [jobmanager-future-thread-6] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 1 for job 2e88260d7fd42515ac6b8181788b2583 (1362646 bytes, checkpointDuration=1623 ms, finalizationTime=291 ms). 12:59:39,015 [ Checkpoint Timer] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 2 (type=SAVEPOINT) @ 1628773179014 for job 2e88260d7fd42515ac6b8181788b2583. 12:59:39,019 [ main] ERROR org.apache.flink.test.checkpointing.RescalingITCase [] - -------------------------------------------------------------------------------- Test testSavepointRescalingWithKeyedAndNonPartitionedState[backend = filesystem, buffersPerChannel = 0](org.apache.flink.test.checkpointing.RescalingITCase) failed with: java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: Invocation of [LocalRpcInvocation(RestfulGateway.triggerSavepoint(JobID, String, boolean, Time))] at recipient [akka://flink/user/rpc/dispatcher_200] timed out. at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928) at org.apache.flink.test.checkpointing.RescalingITCase.testSavepointRescalingWithKeyedAndNonPartitionedState(RescalingITCase.java:425) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45) at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) at org.junit.runners.ParentRunner.run(ParentRunner.java:413) at org.junit.runners.Suite.runChild(Suite.java:128) at org.junit.runners.Suite.runChild(Suite.java:27) at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54) at org.junit.rules.RunRules.evaluate(RunRules.java:20) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.ParentRunner.run(ParentRunner.java:413) at org.junit.runner.JUnitCore.run(JUnitCore.java:137) at org.junit.runner.JUnitCore.run(JUnitCore.java:115) at org.junit.vintage.engine.execution.RunnerExecutor.execute(RunnerExecutor.java:43) at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.Iterator.forEachRemaining(Iterator.java:116) at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485) at org.junit.vintage.engine.VintageTestEngine.executeAllChildren(VintageTestEngine.java:82) at org.junit.vintage.engine.VintageTestEngine.execute(VintageTestEngine.java:73) at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:220) at org.junit.platform.launcher.core.DefaultLauncher.lambda$execute$6(DefaultLauncher.java:188) at org.junit.platform.launcher.core.DefaultLauncher.withInterceptedStreams(DefaultLauncher.java:202) at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:181) at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:128) at org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.invokeAllTests(JUnitPlatformProvider.java:150) at org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.invoke(JUnitPlatformProvider.java:120) at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) Caused by: java.util.concurrent.TimeoutException: Invocation of [LocalRpcInvocation(RestfulGateway.triggerSavepoint(JobID, String, boolean, Time))] at recipient [akka://flink/user/rpc/dispatcher_200] timed out. at com.sun.proxy.$Proxy369.triggerSavepoint(Unknown Source) at org.apache.flink.runtime.minicluster.MiniCluster.lambda$triggerSavepoint$9(MiniCluster.java:741) at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616) at java.util.concurrent.CompletableFuture.uniApplyStage(CompletableFuture.java:628) at java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:1996) at org.apache.flink.runtime.minicluster.MiniCluster.runDispatcherCommand(MiniCluster.java:776) at org.apache.flink.runtime.minicluster.MiniCluster.triggerSavepoint(MiniCluster.java:739) at org.apache.flink.client.program.MiniClusterClient.triggerSavepoint(MiniClusterClient.java:101) at org.apache.flink.test.checkpointing.RescalingITCase.testSavepointRescalingWithKeyedAndNonPartitionedState(RescalingITCase.java:422) ... 60 more Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/rpc/dispatcher_200#1435682858]] after [10000 ms]. Message of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply. {code} I think this problem should be hopefully resolved/mitigated via FLINK-22932. > RescalingITCase.testSavepointRescalingInPartitionedOperatorStateList fails > with Timeout > --------------------------------------------------------------------------------------- > > Key: FLINK-22568 > URL: https://issues.apache.org/jira/browse/FLINK-22568 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.14.0 > Reporter: Matthias > Priority: Major > Labels: test-stability > Fix For: 1.14.0 > > > [This > build|https://dev.azure.com/mapohl/flink/_build/results?buildId=409&view=logs&j=0a15d512-44ac-5ba5-97ab-13a5d066c22c&t=634cd701-c189-5dff-24cb-606ed884db87] > failed (not exclusively) due to: > * [testSavepointRescalingInPartitionedOperatorStateList[backend = > filesystem](org.apache.flink.test.checkpointing.RescalingITCase)|https://dev.azure.com/mapohl/flink/_build/results?buildId=409&view=logs&j=0a15d512-44ac-5ba5-97ab-13a5d066c22c&t=634cd701-c189-5dff-24cb-606ed884db87&l=4193] -- This message was sent by Atlassian Jira (v8.3.4#803005)