Dear Flink experts,

We recently ran into an issue during a job cancellation after upgraded to 1.13. 
After we issue a cancel (from Flink console or flink cancel {jobid}), a few 
subtasks stuck in cancelling state. Once it gets to that situation, the 
behavior is consistent. Those “cancelling tasks will never become canceled. 
After 3 minutes, The job stopped, as a result, number of task manages were 
lost. It will take about another 5 minute for the those lost task manager to 
rejoin the Job manager. Then we can restart the job from the previous 
checkpoint. Found an exception from the hanging (cancelling) Task Manager.
==========================================
        sun.misc.Unsafe.park(Native Method) 
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) 
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
 java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) 
java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742) 
java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1947) 
org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUpInvoke(StreamTask.java:705)
 
org.apache.flink.streaming.runtime.tasks.SourceStreamTask.cleanUpInvoke(SourceStreamTask.java:186)
 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:637) 
org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:776) 
org.apache.flink.runtime.taskmanager.Task.run(Task.java:563) 
java.lang.Thread.run(Thread.java:748)
===========================================

Here are some background information about our job and setup.
1) The job is relatively large, we have 500+ parallelism and 2000+ subtasks. 
It’s mainly reading from a Kinesis stream and perform some transformation and 
fanout to multiple output s3 buckets. It’s a stateless ETL job.
2) The same code and setup running on smaller environments don’t seem to have 
this cancel failure problem. 
3) We have been using Flink 1.11.0 for the same job, and never seen this cancel 
failure and killing Task Manager problem.
4) With upgrading to 1.13, we also added Kubernetes HA (zookeeperless). 
Pervious we don’t not use HA. 

The cancel and restart from previous checkpoint is our regular procedure to 
support daily operation. With this 10 minutes TM restart cycle, it basically 
slowed down our throughput. I try to understand what leads into this situation. 
Hoping maybe some configuration change will smooth things out. Also any 
suggestion to shorten the waiting. It appears to be some timeout on the 
TaskManager and JobManager can be  adjusted to speed it up. But really want to 
avoid stuck in cancellation if we can. 

Thanks you, hoping to get some insight knowledge here.

Ivan

Reply via email to