Hi Ivan From the JM log, the savepoint complete with 1 second, and the timeout exception said that the stop-with-savepoint can not be completed in 60s(this was calculated by 20 -- RestOptions#RETRAY_MAX_ATTEMPTS * 3s -- RestOptions#RETRY_DELAY. you can check the logic here[1]). I'm not sure what the root cause is currently, could you please share the complete job JM log. thanks.
[1] https://github.com/apache/flink/blob/abd58adb7aad54d242b67219498c211e9e18168b/flink-clients/src/main/java/org/apache/flink/client/program/rest/RestClusterClient.java#L382 Best, Congxian Ivan Yang <ivanygy...@gmail.com> 于2020年7月25日周六 上午3:48写道: > Hi Robert, > Below is the job manager log after issuing the “flink stop” command > > ==================== > 2020-07-24 19:24:12,388 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - > Triggering checkpoint 1 (type=CHECKPOINT) @ 1595618652138 for job > 853c59916ac33dfbf46503b33289929e. > 2020-07-24 19:24:13,914 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed > checkpoint 1 for job 853c59916ac33dfbf46503b33289929e (7146 bytes in 1774 > ms). > 2020-07-24 19:27:59,299 INFO org.apache.flink.runtime.jobmaster.JobMaster > [] - Triggering stop-with-savepoint for job > 853c59916ac33dfbf46503b33289929e. > 2020-07-24 19:27:59,655 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - > Triggering checkpoint 2 (type=SYNC_SAVEPOINT) @ 1595618879302 for job > 853c59916ac33dfbf46503b33289929e. > 2020-07-24 19:28:00,962 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed > checkpoint 2 for job 853c59916ac33dfbf46503b33289929e (7147 bytes in 1240 > ms). > ====================== > > It looks normal to me. > > In the kubernetes deployment cluster, we set up a metric reporter, it has > these keys in the flink-config.yaml > > # Metrics Reporter Configuration > metrics.reporters: wavefront > metrics.reporter.wavefront.interval: 60 SECONDS > metrics.reporter.wavefront.env: prod > metrics.reporter.wavefront.class: > com.xxxxxxxxx.flink.monitor.WavefrontReporter > metrics.reporter.wavefront.host: xxxxxx > metrics.reporter.wavefront.token: xxxxxxxxxx > metrics.scope.tm: flink.taskmanager > > Could this reporter interval interfere the job manager? I test the same > job in a standalone > Flink 1.11.0 without the reporter, Flink stop worked, and no hanging nor > timeout. Also the same reporter is used in 1.9.1 version where we didn’t > have issue on “flink stop”. > > Thanks > Ivan > > > On Jul 24, 2020, at 5:15 AM, Robert Metzger <rmetz...@apache.org> wrote: > > Hi Ivan, > thanks a lot for your message. Can you post the JobManager log here as > well? It might contain additional information on the reason for the timeout. > > On Fri, Jul 24, 2020 at 4:03 AM Ivan Yang <ivanygy...@gmail.com> wrote: > >> Hello everyone, >> >> We recently upgrade FLINK from 1.9.1 to 1.11.0. Found one strange >> behavior when we stop a job to a save point got following time out error. >> I checked Flink web console, the save point is created in s3 in 1 >> second.The job is fairly simple, so 1 second for savepoint generation is >> expected. We use kubernetes deployment. I clocked it, it’s about 60 seconds >> when it returns this error. So afterwards, the job is hanging (it still >> says running, but actually not doing anything). I need run another command >> to cancel it. Anyone has idea what’s going on here? BTW, “flink stop works” >> in 1.19.1 for us before >> >> >> >> flink@flink-jobmanager-fcf5d84c5-sz4wk:~$ flink stop >> 88d9b46f59d131428e2a18c9c7b3aa3f >> Suspending job "88d9b46f59d131428e2a18c9c7b3aa3f" with a savepoint. >> >> ------------------------------------------------------------ >> The program finished with the following exception: >> >> org.apache.flink.util.FlinkException: Could not stop with a savepoint job >> "88d9b46f59d131428e2a18c9c7b3aa3f". >> at >> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:495) >> at >> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:864) >> at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:487) >> at >> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:931) >> at >> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:992) >> at java.security.AccessController.doPrivileged(Native Method) >> at javax.security.auth.Subject.doAs(Subject.java:422) >> at >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682) >> at >> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) >> at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:992) >> Caused by: java.util.concurrent.TimeoutException >> at >> java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784) >> at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928) >> at >> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:493) >> ... 9 more >> flink@flink-jobmanager-fcf5d84c5-sz4wk:~$ >> >> >> Thanks in advance, >> Ivan >> > >