Re: Flink 1.11 job stop with save point timeout error

Congxian Qiu Sat, 25 Jul 2020 05:15:23 -0700

Hi Ivan
   From the JM log, the savepoint complete with 1 second, and the timeout
exception said that the stop-with-savepoint can not be completed in
60s(this was calculated by 20 -- RestOptions#RETRAY_MAX_ATTEMPTS * 3s --
RestOptions#RETRY_DELAY. you can check the logic here[1]). I'm not sure
what the root cause is currently, could you please share the complete job
JM log. thanks.


[1]
https://github.com/apache/flink/blob/abd58adb7aad54d242b67219498c211e9e18168b/flink-clients/src/main/java/org/apache/flink/client/program/rest/RestClusterClient.java#L382
Best,
Congxian


Ivan Yang <ivanygy...@gmail.com> 于2020年7月25日周六 上午3:48写道：

> Hi Robert,
> Below is the job manager log after issuing the “flink stop” command
>
> ====================
> 2020-07-24 19:24:12,388 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] -
> Triggering checkpoint 1 (type=CHECKPOINT) @ 1595618652138 for job
> 853c59916ac33dfbf46503b33289929e.
> 2020-07-24 19:24:13,914 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed
> checkpoint 1 for job 853c59916ac33dfbf46503b33289929e (7146 bytes in 1774
> ms).
> 2020-07-24 19:27:59,299 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                 [] - Triggering stop-with-savepoint for job
> 853c59916ac33dfbf46503b33289929e.
> 2020-07-24 19:27:59,655 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] -
> Triggering checkpoint 2 (type=SYNC_SAVEPOINT) @ 1595618879302 for job
> 853c59916ac33dfbf46503b33289929e.
> 2020-07-24 19:28:00,962 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed
> checkpoint 2 for job 853c59916ac33dfbf46503b33289929e (7147 bytes in 1240
> ms).
> ======================
>
> It looks normal to me.
>
> In the kubernetes deployment cluster, we set up a metric reporter, it has
> these keys in the flink-config.yaml
>
> # Metrics Reporter Configuration
> metrics.reporters: wavefront
> metrics.reporter.wavefront.interval: 60 SECONDS
> metrics.reporter.wavefront.env: prod
> metrics.reporter.wavefront.class:
> com.xxxxxxxxx.flink.monitor.WavefrontReporter
> metrics.reporter.wavefront.host: xxxxxx
> metrics.reporter.wavefront.token: xxxxxxxxxx
> metrics.scope.tm: flink.taskmanager
>
> Could this reporter interval interfere the job manager? I test the same
> job in a standalone
> Flink 1.11.0 without the reporter, Flink stop worked, and no hanging nor
> timeout. Also the same reporter is used in 1.9.1 version where we didn’t
> have issue on “flink stop”.
>
> Thanks
> Ivan
>
>
> On Jul 24, 2020, at 5:15 AM, Robert Metzger <rmetz...@apache.org> wrote:
>
> Hi Ivan,
> thanks a lot for your message. Can you post the JobManager log here as
> well? It might contain additional information on the reason for the timeout.
>
> On Fri, Jul 24, 2020 at 4:03 AM Ivan Yang <ivanygy...@gmail.com> wrote:
>
>> Hello everyone,
>>
>> We recently upgrade FLINK from 1.9.1 to 1.11.0. Found one strange
>> behavior when we stop a job to a save point got following time out error.
>> I checked Flink web console, the save point is created in s3 in 1
>> second.The job is fairly simple, so 1 second for savepoint generation is
>> expected. We use kubernetes deployment. I clocked it, it’s about 60 seconds
>> when it returns this error. So afterwards, the job is hanging (it still
>> says running, but actually not doing anything). I need run another command
>> to cancel it. Anyone has idea what’s going on here? BTW, “flink stop works”
>> in 1.19.1 for us before
>>
>>
>>
>> flink@flink-jobmanager-fcf5d84c5-sz4wk:~$ flink stop
>> 88d9b46f59d131428e2a18c9c7b3aa3f
>> Suspending job "88d9b46f59d131428e2a18c9c7b3aa3f" with a savepoint.
>>
>> ------------------------------------------------------------
>>  The program finished with the following exception:
>>
>> org.apache.flink.util.FlinkException: Could not stop with a savepoint job
>> "88d9b46f59d131428e2a18c9c7b3aa3f".
>> at
>> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:495)
>> at
>> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:864)
>> at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:487)
>> at
>> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:931)
>> at
>> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:992)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:422)
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
>> at
>> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>> at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:992)
>> Caused by: java.util.concurrent.TimeoutException
>> at
>> java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
>> at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
>> at
>> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:493)
>> ... 9 more
>> flink@flink-jobmanager-fcf5d84c5-sz4wk:~$
>>
>>
>> Thanks in advance,
>> Ivan
>>
>
>

Re: Flink 1.11 job stop with save point timeout error

Reply via email to