[ https://issues.apache.org/jira/browse/FLINK-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17130458#comment-17130458 ]
Robert Metzger commented on FLINK-18148: ---------------------------------------- Stop with savepoint has to complete within client.timeout = 60s (default). Time breakdown: checkpoint creation takes ~ 52 seconds {code} 2020-06-10 05:02:49,464 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 3 @ 1591765369463 for job 327846815d4c49ae30f9bdc7352218bc. 2020-06-10 05:03:41,100 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 3 for job 327846815d4c49ae30f9bdc7352218bc (401806 bytes in 51636 ms). {code} Stopping the job takes ~ 5 minutes {code} 2020-06-10 05:07:20,002 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job General purpose test job (327846815d4c49ae30f9bdc7352218bc) switched from state RUNNING to FINISHED. {code} The slow operators to stop are: {code} 2020-06-10 05:03:41,175 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: EventSource -> Timestamps/Watermarks (1/4) (396d333e1197d54dfc96352bc4aa5db4) switched from RUNNING to FINISHED. 2020-06-10 05:06:09,917 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - ArtificalKeyedStateMapper_Kryo_and_Custom_Stateful (3/4) (7dfc199016041880c09f9a43a2396c7e) switched from RUNNING to FINISHED. 2020-06-10 05:06:44,858 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - ArtificalKeyedStateMapper_Kryo_and_Custom_Stateful (2/4) (1d56a721f54ab6ae58c9782a2477d484) switched from RUNNING to FINISHED. 2020-06-10 05:07:09,917 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - ArtificalKeyedStateMapper_Kryo_and_Custom_Stateful (4/4) (9fa1b1913284b6a75b8021f9513ce932) switched from RUNNING to FINISHED. 2020-06-10 05:07:18,320 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - ArtificalKeyedStateMapper_Kryo_and_Custom_Stateful (1/4) (26f36930d6622198eb67711eaacae00c) switched from RUNNING to FINISHED. {code} >From the TM logs, it seems that the RocksDB cleanup is slow? {code} 2020-06-10 05:03:41,159 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Un-registering task and sending final execution state FINISHED to JobManager for task Source: EventSource -> Timestamps/Watermarks (1/4) 396d333e1197d54dfc96352bc4aa5db4. 2020-06-10 05:06:09,908 INFO org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend [] - Closed RocksDB State Backend. Cleaning up RocksDB working directory /tmp/flink-io-61f9aa77-daa2-4f83-9552-1f598583d156/job_327846815d4c49ae30f9bdc7352218bc_op_StreamMap_5271c210329e73bd743f3227edfb3b71__3_4__uuid_6e5a12d5-4906-4622-923e-70fb8eb9a23f. 2020-06-10 05:06:09,915 INFO org.apache.flink.runtime.taskmanager.Task [] - ArtificalKeyedStateMapper_Kryo_and_Custom_Stateful (3/4) (7dfc199016041880c09f9a43a2396c7e) switched from RUNNING to FINISHED. {code} > "Resuming Savepoint" e2e fails with TimeoutException in CliFrontend.stop() > --------------------------------------------------------------------------- > > Key: FLINK-18148 > URL: https://issues.apache.org/jira/browse/FLINK-18148 > Project: Flink > Issue Type: Bug > Components: Command Line Client > Affects Versions: 1.11.0, 1.12.0 > Reporter: Robert Metzger > Assignee: Robert Metzger > Priority: Critical > Labels: test-stability > Fix For: 1.11.0 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=2759&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5 > {code} > ------------------------------------------------------------ > The program finished with the following exception: > org.apache.flink.util.FlinkException: Could not stop with a savepoint job > "081bda854bc250e01055ed1ba9d43178". > at > org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:495) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:864) > at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:487) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:931) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:992) > at > org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:992) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784) > at > java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928) > at > org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:493) > ... 6 more > Waiting for job (081bda854bc250e01055ed1ba9d43178) to reach terminal state > FINISHED ... > Job (081bda854bc250e01055ed1ba9d43178) reached terminal state FINISHED > Savepoint location was empty. This may mean that the stop-with-savepoint > failed. > [FAIL] Test script contains errors. > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)