From the stack below, it indicates there are no available buffers for source 
outputs including watermark and normal records, so the source will be blocked 
on request buffer from LocalBufferPool.
The checkpoint process is also affected by above blocking request. The root 
cause is why the queued output buffers are not consumed by downstream tasks.
 I think you can check the downstream task which inqueue usage should reach 
100%, then jstack the corresponding downstream tasks that may stuck in some 
operations to cause back pressure.

Best,
Zhijiang
------------------------------------------------------------------
发件人:Yan Zhou [FDS Science] <yz...@coupang.com>
发送时间:2018年10月23日(星期二) 02:29
收件人:user@flink.apache.org <user@flink.apache.org>
主 题:Re: checkpoint/taskmanager is stuck, deadlock on LocalBufferPool


 I am using flink 1.5.3 
From: Yan Zhou [FDS Science] <yz...@coupang.com>
Sent: Monday, October 22, 2018 11:26
To: user@flink.apache.org
Subject: checkpoint/taskmanager is stuck, deadlock on LocalBufferPool
 Hi,
 My application suddenly stuck and completely doesn't move forward after 
running for a few days. No exceptions are found. From the thread dump, I can 
see that the operator threads and checkpoint threads deadlock on 
LocalBufferPool.  LocalBufferPool is not able to request memory and keep the 
lock. Please see the thread dump at the bottom. 

It uses rocksdb as statebackend. From the heap dump and web ui, there are 
plenty of memory in jvm and it doesn't have GC problem. Check points were good 
until there was the problem:

2018-10-19 04:41:23,691 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering 
checkpoint 4347 @ 1539891683667 for job 1.
2018-10-19 04:41:45,069 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed 
checkpoint 4347 for job 1 (1019729450 bytes in 13600 ms).
2018-10-19 04:46:45,089 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering 
checkpoint 4348 @ 1539892005069 for job 1.
2018-10-19 04:56:45,089 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Checkpoint 4348 
of job 1 expired before completing.


This happened at mid night and the traffic was relatively low. Even if there 
was a spike and caused a back pressure, to my understand that the events should 
be processed  eventually and the network buffer would be available after that. 
What might be the cause of it? 


 Best
 Yan

"Time Trigger for Source: Custom Source -> Flat Map -> Flat Map -> 
Timestamps/Watermarks -> (from: ...s#363 daemon prio=5 os_prio=0 
tid=0x00007ff187944000 nid=0x8f76 in Object.wait() [0x00007ff12fda9000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247)
- locked <0x00000006dadeeac8> (a java.util.ArrayDeque)
at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204)
at 
org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213)
at 
org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144)
at 
org.apache.flink.runtime.io.network.api.writer.RecordWriter.broadcastEmit(RecordWriter.java:117)
at 
org.apache.flink.streaming.runtime.io.StreamRecordWriter.broadcastEmit(StreamRecordWriter.java:87)
at 
org.apache.flink.streaming.runtime.io.RecordWriterOutput.emitWatermark(RecordWriterOutput.java:121)
at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.emitWatermark(AbstractStreamOperator.java:668)
at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator.processWatermark(AbstractStreamOperator.java:736)
at 
org.apache.flink.streaming.api.operators.ProcessOperator.processWatermark(ProcessOperator.java:72)
at 
org.apache.flink.streaming.runtime.tasks.OperatorChain$ChainingOutput.emitWatermark(OperatorChain.java:479)
at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.emitWatermark(AbstractStreamOperator.java:668)
at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator.processWatermark(AbstractStreamOperator.java:736)
at 
org.apache.flink.streaming.api.operators.ProcessOperator.processWatermark(ProcessOperator.java:72)
at 
org.apache.flink.streaming.runtime.tasks.OperatorChain$ChainingOutput.emitWatermark(OperatorChain.java:479)
at 
org.apache.flink.streaming.runtime.tasks.OperatorChain$BroadcastingOutputCollector.emitWatermark(OperatorChain.java:603)
at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.emitWatermark(AbstractStreamOperator.java:668)
at 
org.apache.flink.streaming.runtime.operators.TimestampsAndPeriodicWatermarksOperator.onProcessingTime(TimestampsAndPeriodicWatermarksOperator.java:77)
at 
org.apache.flink.streaming.runtime.tasks.SystemProcessingTimeService$TriggerTask.run(SystemProcessingTimeService.java:285)
- locked <0x00000006dcb3db50> (a java.lang.Object)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
- <0x00000006eac38310> (a java.util.concurrent.ThreadPoolExecutor$Worker)


"Source: Custom Source -> Flat Map -> Flat Map -> Timestamps/Watermarks -> 
(from: ...s#116 prio=5 os_prio=0 tid=0x00007ff1a791a000 nid=0x78c9 waiting for 
monitor entry [0x00007ff14f6da000]
  java.lang.Thread.State: BLOCKED (on object monitor)
at 
org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.emitRecordWithTimestamp(AbstractFetcher.java:397)
- waiting to lock <0x00000006dcb3db50> (a java.lang.Object)
at 
org.apache.flink.streaming.connectors.kafka.internal.Kafka010Fetcher.emitRecord(Kafka010Fetcher.java:89)
at 
org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.runFetchLoop(Kafka09Fetcher.java:154)
at 
org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run(FlinkKafkaConsumerBase.java:738)
at 
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:87)
at 
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:56)
at 
org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:99)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:306)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:712)
at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
- None


"Async calls on Source: Custom Source -> Flat Map -> Flat Map -> 
Timestamps/Watermarks -> (from: (... sel#409 daemon prio=5 os_prio=0 
tid=0x00007ff1682fc800 nid=0x943c waiting for monitor entry [0x00007ff0dae3a000]
  java.lang.Thread.State: BLOCKED (on object monitor)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:620)
- waiting to lock <0x00000006dcb3db50> (a java.lang.Object)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:564)
at 
org.apache.flink.streaming.runtime.tasks.SourceStreamTask.triggerCheckpoint(SourceStreamTask.java:116)
at org.apache.flink.runtime.taskmanager.Task$1.run(Task.java:1229)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
- <0x00000006ee7095f0> (a java.util.concurrent.ThreadPoolExecutor$Worker)

Reply via email to