This is the first of two recovery problems I'm seeing running Flink 1.6.3 in
Kubernetes. I'm posting them in separate messages for brevity and because the
second is not directly related to the first. Any advice is appreciated.
Setup:
Flink 1.6.3 in Kubernetes (flink:1.6.3-hadoop28-scala_2.11). One JobManager
and two TaskManagers (TM_1, TM_2). Each pod has 4 CPUs. Each TaskManager has
16 task slots. High availability is enabled. S3 (s3a) for storage. RocksDB
with incremental snapshots. It doesn't matter if local recover is enabled -
I've managed to replicate with both local recovery enabled and disabled.
Problem:
Flink cannot recover a job unless there are the same number of free task slots
as the job's parallelism.
Replication steps:
Create a job with a parallelism of either 17 or 32 - enough to force the job to
use both TMs. After the job has successfully is fully running and has taken a
checkpoint, delete one of the TaskManagers (TM_1). Kubernetes will spawn a new
TaskManager (TM_3) which will successfully connect to the JobManager.
Actual Behavior:
The running job will be canceled and redeployed but will be caught in a
SCHEDULED state (shows as CREATED in the web UI). JobManager will repeatively
attempt to request slots from the ResourceManager. The tasks in the job will
never resume.
Expected Behavior:
Job should be fully unscheduled from TM_2. TM_2 and TM_3 should pick up the
job. The job should successfully resume from the last checkpoint.
Known Workarounds:
1) Cancel and resubmit the job.
2) Using the above example, have a free TaskManager (TM_4) that also has 16
available slots.
Log snip:
2019-01-10 19:42:50,299 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph -
Co-Process-Broadcast -> Flat Map -> Sink: Unnamed (29/32)
(6078b9c76953c7c27b05b522880d3d1b) switched from CANCELING to CANCELED.
2019-01-10 19:42:50,299 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph - Try to restart
or fail the job streamProcessorJob (c44a91b76ea99ead6fdf9b13a98c15bb) if no
longer possible.
2019-01-10 19:42:50,299 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph - Job
streamProcessorJob (c44a91b76ea99ead6fdf9b13a98c15bb) switched from state
FAILING to RESTARTING.
2019-01-10 19:42:50,299 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph - Restarting the
job streamProcessorJob (c44a91b76ea99ead6fdf9b13a98c15bb).
2019-01-10 19:42:50,302 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph - Job
streamProcessorJob (c44a91b76ea99ead6fdf9b13a98c15bb) switched from state
RESTARTING to CREATED.
2019-01-10 19:42:50,302 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore -
Recovering checkpoints from ZooKeeper.
2019-01-10 19:42:50,308 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found
1 checkpoints in ZooKeeper.
2019-01-10 19:42:50,308 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to fetch 1 checkpoints from storage.
2019-01-10 19:42:50,308 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 1.
2019-01-10 19:42:50,386 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Restoring job
c44a91b76ea99ead6fdf9b13a98c15bb from latest valid checkpoint: Checkpoint 1 @
1547149215694 for c44a91b76ea99ead6fdf9b13a98c15bb.
2019-01-10 19:42:50,388 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - No master state
to restore
2019-01-10 19:42:50,388 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph - Job
streamProcessorJob (c44a91b76ea99ead6fdf9b13a98c15bb) switched from state
CREATED to RUNNING.
2019-01-10 19:42:50,388 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph - Source:
Purchase Order Kafka Consumer (1/1) (49b728769a3a2b3a3a6ba45cd4445e3b) switched
from CREATED to SCHEDULED.
2019-01-10 19:42:50,388 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph - Source:
Purchase Order Bundle Kafka Consumer (1/1) (1220cf4b9f5eb937191bb2232a482899)
switched from CREATED to SCHEDULED.
2019-01-10 19:42:50,389 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph - Source:
ControlInstruction Kafka Consumer -> Filter -> Filter -> Map (1/1)
(29f69ee8fbc208cd7c63e99907d11386) switched from CREATED to SCHEDULED.
2019-01-10 19:42:50,389 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Ticket
Update Kafka Consumer (1/1) (4bdfbcb7280fb7a7c9ea2d5aa02efa41) switched from
CREATED to SCHEDULED.
2019-01-10 19:42:50,389 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Requesting new
slot [SlotRequestId{83bcd1c29b885a7799bf6e5d73d1961c}] and profile
ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0,
nativeMemoryInMB=0, networkMemoryInMB=0} from resource manager.
2019-01-10 19:42:50,389 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph - Source:
Purchase Order Product Kafka Consumer (1/1) (9bfdbe9141c8b6715b890a35b026bb3b)
switched from CREATED to SCHEDULED.
...snip...
2019-01-10 19:42:50,471 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph -
Co-Process-Broadcast -> Flat Map -> Sink: Unnamed (32/32)
(cd851b29b5533f8a3d4812ac7e8d47ab) switched from CREATED to SCHEDULED.
2019-01-10 19:42:52,733 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager -
Registering TaskManager with ResourceID 6c84753ff746978c90a2e6d7627e6263
(akka.tcp://flink@flink-taskmanager-5748955d9-4wmj9:6126/user/taskmanager_0) at
ResourceManager
2019-01-10 19:42:52,736 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager -
Registering TaskManager with ResourceID 6c84753ff746978c90a2e6d7627e6263
(akka.tcp://flink@flink-taskmanager-5748955d9-4wmj9:6126/user/taskmanager_0) at
ResourceManager
2019-01-10 19:42:55,572 WARN akka.remote.ReliableDeliverySupervisor
- Association with remote system
[akka.tcp://flink@flink-taskmanager-5748955d9-vgtl4:6126] has failed, address
is now gated for [50] ms. Reason: [Association failed with
[akka.tcp://flink@flink-taskmanager-5748955d9-vgtl4:6126]] Caused by:
[flink-taskmanager-5748955d9-vgtl4: Name or service not known]
2019-01-10 19:44:20,388 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint
triggering task Source: ControlInstruction Kafka Consumer -> Filter -> Filter
-> Map (1/1) of job c44a91b76ea99ead6fdf9b13a98c15bb is not in state RUNNING
but SCHEDULED instead. Aborting checkpoint.
2019-01-10 19:45:50,388 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint
triggering task Source: ControlInstruction Kafka Consumer -> Filter -> Filter
-> Map (1/1) of job c44a91b76ea99ead6fdf9b13a98c15bb is not in state RUNNING
but SCHEDULED instead. Aborting checkpoint.
Many thanks,
John Stone