[ 
https://issues.apache.org/jira/browse/FLINK-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17130458#comment-17130458
 ] 

Robert Metzger commented on FLINK-18148:
----------------------------------------

Stop with savepoint has to complete within client.timeout = 60s (default).

Time breakdown:
checkpoint creation takes ~ 52 seconds
{code}
2020-06-10 05:02:49,464 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering 
checkpoint 3 @ 1591765369463 for job 327846815d4c49ae30f9bdc7352218bc.
2020-06-10 05:03:41,100 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed 
checkpoint 3 for job 327846815d4c49ae30f9bdc7352218bc (401806 bytes in 51636 
ms).
{code}

Stopping the job takes ~ 5 minutes
{code}
2020-06-10 05:07:20,002 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job General 
purpose test job (327846815d4c49ae30f9bdc7352218bc) switched from state RUNNING 
to FINISHED.
{code}

The slow operators to stop are:
{code}
2020-06-10 05:03:41,175 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: 
EventSource -> Timestamps/Watermarks (1/4) (396d333e1197d54dfc96352bc4aa5db4) 
switched from RUNNING to FINISHED.
2020-06-10 05:06:09,917 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - 
ArtificalKeyedStateMapper_Kryo_and_Custom_Stateful (3/4) 
(7dfc199016041880c09f9a43a2396c7e) switched from RUNNING to FINISHED.
2020-06-10 05:06:44,858 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - 
ArtificalKeyedStateMapper_Kryo_and_Custom_Stateful (2/4) 
(1d56a721f54ab6ae58c9782a2477d484) switched from RUNNING to FINISHED.
2020-06-10 05:07:09,917 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - 
ArtificalKeyedStateMapper_Kryo_and_Custom_Stateful (4/4) 
(9fa1b1913284b6a75b8021f9513ce932) switched from RUNNING to FINISHED.
2020-06-10 05:07:18,320 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - 
ArtificalKeyedStateMapper_Kryo_and_Custom_Stateful (1/4) 
(26f36930d6622198eb67711eaacae00c) switched from RUNNING to FINISHED.
{code}

>From the TM logs, it seems that the RocksDB cleanup is slow?
{code}
2020-06-10 05:03:41,159 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - 
Un-registering task and sending final execution state FINISHED to JobManager 
for task Source: EventSource -> Timestamps/Watermarks (1/4) 
396d333e1197d54dfc96352bc4aa5db4.
2020-06-10 05:06:09,908 INFO  
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend [] - Closed 
RocksDB State Backend. Cleaning up RocksDB working directory 
/tmp/flink-io-61f9aa77-daa2-4f83-9552-1f598583d156/job_327846815d4c49ae30f9bdc7352218bc_op_StreamMap_5271c210329e73bd743f3227edfb3b71__3_4__uuid_6e5a12d5-4906-4622-923e-70fb8eb9a23f.
2020-06-10 05:06:09,915 INFO  org.apache.flink.runtime.taskmanager.Task         
           [] - ArtificalKeyedStateMapper_Kryo_and_Custom_Stateful (3/4) 
(7dfc199016041880c09f9a43a2396c7e) switched from RUNNING to FINISHED.
{code}


> "Resuming Savepoint" e2e fails with TimeoutException in CliFrontend.stop() 
> ---------------------------------------------------------------------------
>
>                 Key: FLINK-18148
>                 URL: https://issues.apache.org/jira/browse/FLINK-18148
>             Project: Flink
>          Issue Type: Bug
>          Components: Command Line Client
>    Affects Versions: 1.11.0, 1.12.0
>            Reporter: Robert Metzger
>            Assignee: Robert Metzger
>            Priority: Critical
>              Labels: test-stability
>             Fix For: 1.11.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=2759&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5
> {code}
> ------------------------------------------------------------
>  The program finished with the following exception:
> org.apache.flink.util.FlinkException: Could not stop with a savepoint job 
> "081bda854bc250e01055ed1ba9d43178".
>       at 
> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:495)
>       at 
> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:864)
>       at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:487)
>       at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:931)
>       at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:992)
>       at 
> org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
>       at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:992)
> Caused by: java.util.concurrent.TimeoutException
>       at 
> java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
>       at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
>       at 
> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:493)
>       ... 6 more
> Waiting for job (081bda854bc250e01055ed1ba9d43178) to reach terminal state 
> FINISHED ...
> Job (081bda854bc250e01055ed1ba9d43178) reached terminal state FINISHED
> Savepoint location was empty. This may mean that the stop-with-savepoint 
> failed.
> [FAIL] Test script contains errors.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to