Hi, Ivan My gut feeling is that it is related to FLINK-22535. Could @Yun Gao take another look? If that is the case, you can upgrade to 1.13.1.
Best, Yangze Guo On Tue, Jul 27, 2021 at 9:41 AM Ivan Yang <ivanygy...@gmail.com> wrote: > > Dear Flink experts, > > We recently ran into an issue during a job cancellation after upgraded to > 1.13. After we issue a cancel (from Flink console or flink cancel {jobid}), a > few subtasks stuck in cancelling state. Once it gets to that situation, the > behavior is consistent. Those “cancelling tasks will never become canceled. > After 3 minutes, The job stopped, as a result, number of task manages were > lost. It will take about another 5 minute for the those lost task manager to > rejoin the Job manager. Then we can restart the job from the previous > checkpoint. Found an exception from the hanging (cancelling) Task Manager. > ========================================== > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707) > java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742) > java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1947) > org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUpInvoke(StreamTask.java:705) > > org.apache.flink.streaming.runtime.tasks.SourceStreamTask.cleanUpInvoke(SourceStreamTask.java:186) > > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:637) > org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:776) > org.apache.flink.runtime.taskmanager.Task.run(Task.java:563) > java.lang.Thread.run(Thread.java:748) > =========================================== > > Here are some background information about our job and setup. > 1) The job is relatively large, we have 500+ parallelism and 2000+ subtasks. > It’s mainly reading from a Kinesis stream and perform some transformation and > fanout to multiple output s3 buckets. It’s a stateless ETL job. > 2) The same code and setup running on smaller environments don’t seem to have > this cancel failure problem. > 3) We have been using Flink 1.11.0 for the same job, and never seen this > cancel failure and killing Task Manager problem. > 4) With upgrading to 1.13, we also added Kubernetes HA (zookeeperless). > Pervious we don’t not use HA. > > The cancel and restart from previous checkpoint is our regular procedure to > support daily operation. With this 10 minutes TM restart cycle, it basically > slowed down our throughput. I try to understand what leads into this > situation. Hoping maybe some configuration change will smooth things out. > Also any suggestion to shorten the waiting. It appears to be some timeout on > the TaskManager and JobManager can be adjusted to speed it up. But really > want to avoid stuck in cancellation if we can. > > Thanks you, hoping to get some insight knowledge here. > > Ivan