Hey guys,

We just built a brand new Flink 1.4.0 cluster with HA and everything seems
to be working fine, but we are getting some errors with savepoints.

For example, I have a running job

------------------ Running/Restarting Jobs -------------------
25.07.2018 11:55:18 : e5280bad25a7f19122f98483f94aba26 : Mr Banks (RUNNING)
--------------------------------------------------------------

If I try to create a savepoint with

flink savepoint e5280bad25a7f19122f98483f94aba26

The command just stays there and never returns (I waited about 10 minutes,
with no response). Then I tried to cancel with savepoint:

flink cancel e5280bad25a7f19122f98483f94aba26 -s

And I got a

java.util.concurrent.TimeoutException: Futures timed out after [60000
milliseconds]

I checked the jobmanager logs, but I can't see any problems; I checked the
Hadoop logs for any errors (believing the problem may be in the underlying
system), but it seems it did create the nodes properly -- at least, there
are no errors there too.

Is there anything else I should check?

PS: My state is not that big (my napkin calculations say it's less than
1Gb) so it doesn't seem it's a problem with the state size taking too long
to be saved.

-- 
*Julio Biason*, Sofware Engineer
*AZION*  |  Deliver. Accelerate. Protect.
Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51
<callto:+5551996209291>*99907 0554*

Reply via email to