Re: Could not cancel job (with savepoint) "Ask timed out"

Juho Autio Thu, 09 Aug 2018 00:08:08 -0700

Thanks for the suggestion. Is the separate savepoint triggering async?
Would you then separately poll for the savepoint's completion before
executing cancel? If additional polling is needed, then I would say that
for my purpose it's still easier to call cancel with savepoint and simply
ignore the result of the call. I would assume that it won't do any harm if
I keep retrying cancel with savepoint until the job stops – I expect that
an overlapping cancel request is ignored if the job is already creating a
savepoint. Please correct if my assumption is wrong.


On Thu, Aug 9, 2018 at 5:04 AM vino yang <yanghua1...@gmail.com> wrote:

> Hi Juho,
>
> This problem does exist, I suggest you separate these two steps to
> temporarily deal with this problem:
> 1) Trigger Savepoint separately;
> 2) execute the cancel command;
>
> Hi Till, Chesnay:
>
> Our internal environment and multiple users on the mailing list have
> encountered similar problems.
>
> In our environment, it seems that JM shows that the save point is complete
> and JM has stopped itself, but the client will still connect to the old JM
> and report a timeout exception.
>
> Thanks, vino.
>
>
> Juho Autio <juho.au...@rovio.com> 于2018年8月8日周三 下午9:18写道：
>
>> I was trying to cancel a job with savepoint, but the CLI command failed
>> with "akka.pattern.AskTimeoutException: Ask timed out".
>>
>> The stack trace reveals that ask timeout is 10 seconds:
>>
>> Caused by: akka.pattern.AskTimeoutException: Ask timed out on
>> [Actor[akka://flink/user/jobmanager_0#106635280]] after [10000 ms].
>> Sender[null] sent message of type
>> "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".
>>
>> Indeed it's documented that the default value for akka.ask.timeout="10
>> s" in
>>
>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#distributed-coordination-via-akka
>>
>> Behind the scenes the savepoint creation & job cancellation succeeded,
>> that was to be expected, kind of. So my problem is just getting a proper
>> response back from the CLI call instead of timing out so eagerly.
>>
>> To be exact, what I ran was:
>>
>> flink-1.5.2/bin/flink cancel b7c7d19d25e16a952d3afa32841024e5 -m
>> yarn-cluster -yid application_1533676784032_0001 --withSavepoint
>>
>> Should I change the akka.ask.timeout to have a longer timeout? If yes,
>> can I override it just for the CLI call somehow? Maybe it might have
>> undesired side-effects if set globally for the actual flink jobs to use?
>>
>> What about akka.client.timeout? The default for it is also rather
>> low: "60 s". Should it also be increased accordingly if I want to accept
>> longer than 60 s for savepoint creation?
>>
>> Finally, that default timeout is so low that I would expect this to be a
>> common problem. I would say that Flink CLI should have higher default
>> timeout for cancel and savepoint creation ops.
>>
>> Thanks!
>>
>

Re: Could not cancel job (with savepoint) "Ask timed out"

Reply via email to