Issue while restarting from SavePoint

Anirudh Mallem Tue, 18 Oct 2016 14:08:45 -0700

Hi,
I am relatively new to Flink and I was experimenting with the save points 
feature. I have an HA cluster running with 1 Master and 4 Workers. The 
flink-config.yaml is as follows :


#==============================================================================
# Common
#==============================================================================
jobmanager.rpc.address: stable-stream-master01.app.shared.int.sv2.247-inc.net

# The port where the JobManager's main actor system listens for messages.
jobmanager.rpc.port: 6123

# The heap size for the JobManager JVM
jobmanager.heap.mb: 512

# The heap size for the TaskManager JVM
taskmanager.heap.mb: 2048

# The number of task slots that each TaskManager offers. Each slot runs one 
parallel pipeline.
taskmanager.numberOfTaskSlots: 8

# Specify whether TaskManager memory should be allocated when starting up 
(true) or when
# memory is required in the memory manager (false)
taskmanager.memory.preallocate: false

# The parallelism used for programs that did not specify and other parallelism.
parallelism.default: 1

env.java.home: /usr/local/java

restart-strategy: fixed-delay
restart-strategy.fixed-delay.attempts: 2
restart-strategy.fixed-delay.delay: 10 s
#==============================================================================
# Web Frontend
#==============================================================================

# The port under which the web-based runtime monitor listens.
# A value of -1 deactivates the web server.

jobmanager.web.port: 8081

# Flag to specify whether job submission is enabled from the web-based
# runtime monitor. Uncomment to disable.

#jobmanager.web.submit.enable: false

#==============================================================================
# Streaming state checkpointing
#==============================================================================

# The backend that will be used to store operator state checkpoints if
# checkpointing is enabled.
#
# Supported backends: jobmanager, filesystem, <class-name-of-factory>
#
state.backend: filesystem


# Directory for storing checkpoints in a Flink-supported filesystem
# Note: State backend must be accessible from the JobManager and all 
TaskManagers.
# Use "hdfs://" for HDFS setups, "file://" for UNIX/POSIX-compliant file 
systems,
# (or any local file system under Windows), or "S3://" for S3 file system.
#
 state.backend.fs.checkpointdir: file:///home/amallem/
 state.savepoints.dir: file:///home/amallem/save/

#==============================================================================
# Master High Availability (required configuration)
#==============================================================================

# The list of ZooKepper quorum peers that coordinate the high-availability
# setup. This must be a list of the form:
# "host1:clientPort,host2[:clientPort],..." (default clientPort: 2181)
#
 recovery.mode: zookeeper
#
 recovery.zookeeper.quorum: 
stable-stream-zookeeper01.app.shared.int.net:2181,stable-stream-zookeeper02.app.shared.int.net:2181
#
# Note: You need to set the state backend to 'filesystem' and the checkpoint
# directory (see above) before configuring the storageDir.
#
 recovery.zookeeper.storageDir: file:///home/amallem/recovery
 recovery.zookeeper.path.root: /flink
 recovery.zookeeper.path.namespace: /cluster_one

Query : I am able to take a save point and then cancel my current job but when 
I try to start the job again using the save point I get the error stating 
“ProgramInvocationException: JobManager did not respond within 60000ms”. I 
checked Zookeeper and everything seems fine. I also followed the 
recommendations in the following post : 
http://stackoverflow.com/questions/36625742/cant-deploy-flow-to-ha-cluster-of-apache-flink-using-flink-cli
 Is there any configuration I am missing to enable restarting of the job. Any 
help will be appreciated. Thanks.

Regards,
Anirudh

Issue while restarting from SavePoint

Reply via email to