Recovery problem in Flink 1.6.3

John Stone Thu, 10 Jan 2019 12:46:52 -0800

This is the first of two recovery problems I'm seeing running Flink 1.6.3 in 
Kubernetes.  I'm posting them in separate messages for brevity and because the 
second is not directly related to the first.  Any advice is appreciated.


Setup:
Flink 1.6.3 in Kubernetes (flink:1.6.3-hadoop28-scala_2.11).  One JobManager 
and two TaskManagers (TM_1, TM_2).  Each pod has 4 CPUs.  Each TaskManager has 
16 task slots.  High availability is enabled.  S3 (s3a) for storage.  RocksDB 
with incremental snapshots.  It doesn't matter if local recover is enabled - 
I've managed to replicate with both local recovery enabled and disabled.

Problem:
Flink cannot recover a job unless there are the same number of free task slots 
as the job's parallelism.

Replication steps:
Create a job with a parallelism of either 17 or 32 - enough to force the job to 
use both TMs.  After the job has successfully is fully running and has taken a 
checkpoint, delete one of the TaskManagers (TM_1).  Kubernetes will spawn a new 
TaskManager (TM_3) which will successfully connect to the JobManager.

Actual Behavior:
The running job will be canceled and redeployed but will be caught in a 
SCHEDULED state (shows as CREATED in the web UI).  JobManager will repeatively 
attempt to request slots from the ResourceManager.  The tasks in the job will 
never resume.

Expected Behavior:
Job should be fully unscheduled from TM_2.  TM_2 and TM_3 should pick up the 
job.  The job should successfully resume from the last checkpoint.

Known Workarounds:
1) Cancel and resubmit the job.
2) Using the above example, have a free TaskManager (TM_4) that also has 16 
available slots.

Log snip:
2019-01-10 19:42:50,299 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph        - 
Co-Process-Broadcast -> Flat Map -> Sink: Unnamed (29/32) 
(6078b9c76953c7c27b05b522880d3d1b) switched from CANCELING to CANCELED.
2019-01-10 19:42:50,299 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Try to restart 
or fail the job streamProcessorJob (c44a91b76ea99ead6fdf9b13a98c15bb) if no 
longer possible.
2019-01-10 19:42:50,299 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job 
streamProcessorJob (c44a91b76ea99ead6fdf9b13a98c15bb) switched from state 
FAILING to RESTARTING.
2019-01-10 19:42:50,299 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Restarting the 
job streamProcessorJob (c44a91b76ea99ead6fdf9b13a98c15bb).
2019-01-10 19:42:50,302 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job 
streamProcessorJob (c44a91b76ea99ead6fdf9b13a98c15bb) switched from state 
RESTARTING to CREATED.
2019-01-10 19:42:50,302 INFO  
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - 
Recovering checkpoints from ZooKeeper.
2019-01-10 19:42:50,308 INFO  
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Found 
1 checkpoints in ZooKeeper.
2019-01-10 19:42:50,308 INFO  
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Trying 
to fetch 1 checkpoints from storage.
2019-01-10 19:42:50,308 INFO  
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Trying 
to retrieve checkpoint 1.
2019-01-10 19:42:50,386 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Restoring job 
c44a91b76ea99ead6fdf9b13a98c15bb from latest valid checkpoint: Checkpoint 1 @ 
1547149215694 for c44a91b76ea99ead6fdf9b13a98c15bb.
2019-01-10 19:42:50,388 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - No master state 
to restore
2019-01-10 19:42:50,388 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job 
streamProcessorJob (c44a91b76ea99ead6fdf9b13a98c15bb) switched from state 
CREATED to RUNNING.
2019-01-10 19:42:50,388 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: 
Purchase Order Kafka Consumer (1/1) (49b728769a3a2b3a3a6ba45cd4445e3b) switched 
from CREATED to SCHEDULED.
2019-01-10 19:42:50,388 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: 
Purchase Order Bundle Kafka Consumer (1/1) (1220cf4b9f5eb937191bb2232a482899) 
switched from CREATED to SCHEDULED.
2019-01-10 19:42:50,389 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: 
ControlInstruction Kafka Consumer -> Filter -> Filter -> Map (1/1) 
(29f69ee8fbc208cd7c63e99907d11386) switched from CREATED to SCHEDULED.
2019-01-10 19:42:50,389 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: Ticket 
Update Kafka Consumer (1/1) (4bdfbcb7280fb7a7c9ea2d5aa02efa41) switched from 
CREATED to SCHEDULED.
2019-01-10 19:42:50,389 INFO  
org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Requesting new 
slot [SlotRequestId{83bcd1c29b885a7799bf6e5d73d1961c}] and profile 
ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, 
nativeMemoryInMB=0, networkMemoryInMB=0} from resource manager.
2019-01-10 19:42:50,389 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: 
Purchase Order Product Kafka Consumer (1/1) (9bfdbe9141c8b6715b890a35b026bb3b) 
switched from CREATED to SCHEDULED.

...snip...

2019-01-10 19:42:50,471 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph        - 
Co-Process-Broadcast -> Flat Map -> Sink: Unnamed (32/32) 
(cd851b29b5533f8a3d4812ac7e8d47ab) switched from CREATED to SCHEDULED.
2019-01-10 19:42:52,733 INFO  
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - 
Registering TaskManager with ResourceID 6c84753ff746978c90a2e6d7627e6263 
(akka.tcp://flink@flink-taskmanager-5748955d9-4wmj9:6126/user/taskmanager_0) at 
ResourceManager
2019-01-10 19:42:52,736 INFO  
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - 
Registering TaskManager with ResourceID 6c84753ff746978c90a2e6d7627e6263 
(akka.tcp://flink@flink-taskmanager-5748955d9-4wmj9:6126/user/taskmanager_0) at 
ResourceManager
2019-01-10 19:42:55,572 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system 
[akka.tcp://flink@flink-taskmanager-5748955d9-vgtl4:6126] has failed, address 
is now gated for [50] ms. Reason: [Association failed with 
[akka.tcp://flink@flink-taskmanager-5748955d9-vgtl4:6126]] Caused by: 
[flink-taskmanager-5748955d9-vgtl4: Name or service not known]
2019-01-10 19:44:20,388 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Checkpoint 
triggering task Source: ControlInstruction Kafka Consumer -> Filter -> Filter 
-> Map (1/1) of job c44a91b76ea99ead6fdf9b13a98c15bb is not in state RUNNING 
but SCHEDULED instead. Aborting checkpoint.
2019-01-10 19:45:50,388 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Checkpoint 
triggering task Source: ControlInstruction Kafka Consumer -> Filter -> Filter 
-> Map (1/1) of job c44a91b76ea99ead6fdf9b13a98c15bb is not in state RUNNING 
but SCHEDULED instead. Aborting checkpoint.

Many thanks,

John Stone

Recovery problem in Flink 1.6.3

Reply via email to