I am using flink 1.5.3
________________________________
From: Yan Zhou [FDS Science] <yz...@coupang.com>
Sent: Monday, October 22, 2018 11:26
To: user@flink.apache.org
Subject: checkpoint/taskmanager is stuck, deadlock on LocalBufferPool

Hi,
My application suddenly stuck and completely doesn't move forward after running 
for a few days. No exceptions are found. From the thread dump, I can see that 
the operator threads and checkpoint threads deadlock on LocalBufferPool. 
LocalBufferPool is not able to request memory and keep the lock. Please see the 
thread dump at the bottom.

It uses rocksdb as statebackend. From the heap dump and web ui, there are 
plenty of memory in jvm and it doesn't have GC problem. Check points were good 
until there was the problem:

2018-10-19 04:41:23,691 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering 
checkpoint 4347 @ 1539891683667 for job 1.
2018-10-19 04:41:45,069 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed 
checkpoint 4347 for job 1 (1019729450 bytes in 13600 ms).
2018-10-19 04:46:45,089 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering 
checkpoint 4348 @ 1539892005069 for job 1.
2018-10-19 04:56:45,089 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Checkpoint 4348 
of job 1 expired before completing.


This happened at mid night and the traffic was relatively low. Even if there 
was a spike and caused a back pressure, to my understand that the events should 
be processed  eventually and the network buffer would be available after that. 
What might be the cause of it?


Best
Yan

"Time Trigger for Source: Custom Source -> Flat Map -> Flat Map -> 
Timestamps/Watermarks -> (from: ...s#363 daemon prio=5 os_prio=0 
tid=0x00007ff187944000 nid=0x8f76 in Object.wait() [0x00007ff12fda9000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247)
- locked <0x00000006dadeeac8> (a java.util.ArrayDeque)
at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204)
at 
org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213)
at 
org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144)
at 
org.apache.flink.runtime.io.network.api.writer.RecordWriter.broadcastEmit(RecordWriter.java:117)
at 
org.apache.flink.streaming.runtime.io.StreamRecordWriter.broadcastEmit(StreamRecordWriter.java:87)
at 
org.apache.flink.streaming.runtime.io.RecordWriterOutput.emitWatermark(RecordWriterOutput.java:121)
at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.emitWatermark(AbstractStreamOperator.java:668)
at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator.processWatermark(AbstractStreamOperator.java:736)
at 
org.apache.flink.streaming.api.operators.ProcessOperator.processWatermark(ProcessOperator.java:72)
at 
org.apache.flink.streaming.runtime.tasks.OperatorChain$ChainingOutput.emitWatermark(OperatorChain.java:479)
at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.emitWatermark(AbstractStreamOperator.java:668)
at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator.processWatermark(AbstractStreamOperator.java:736)
at 
org.apache.flink.streaming.api.operators.ProcessOperator.processWatermark(ProcessOperator.java:72)
at 
org.apache.flink.streaming.runtime.tasks.OperatorChain$ChainingOutput.emitWatermark(OperatorChain.java:479)
at 
org.apache.flink.streaming.runtime.tasks.OperatorChain$BroadcastingOutputCollector.emitWatermark(OperatorChain.java:603)
at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.emitWatermark(AbstractStreamOperator.java:668)
at 
org.apache.flink.streaming.runtime.operators.TimestampsAndPeriodicWatermarksOperator.onProcessingTime(TimestampsAndPeriodicWatermarksOperator.java:77)
at 
org.apache.flink.streaming.runtime.tasks.SystemProcessingTimeService$TriggerTask.run(SystemProcessingTimeService.java:285)
- locked <0x00000006dcb3db50> (a java.lang.Object)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
- <0x00000006eac38310> (a java.util.concurrent.ThreadPoolExecutor$Worker)


"Source: Custom Source -> Flat Map -> Flat Map -> Timestamps/Watermarks -> 
(from: ...s#116 prio=5 os_prio=0 tid=0x00007ff1a791a000 nid=0x78c9 waiting for 
monitor entry [0x00007ff14f6da000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at 
org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.emitRecordWithTimestamp(AbstractFetcher.java:397)
- waiting to lock <0x00000006dcb3db50> (a java.lang.Object)
at 
org.apache.flink.streaming.connectors.kafka.internal.Kafka010Fetcher.emitRecord(Kafka010Fetcher.java:89)
at 
org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.runFetchLoop(Kafka09Fetcher.java:154)
at 
org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run(FlinkKafkaConsumerBase.java:738)
at 
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:87)
at 
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:56)
at 
org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:99)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:306)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:712)
at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
- None


"Async calls on Source: Custom Source -> Flat Map -> Flat Map -> 
Timestamps/Watermarks -> (from: (... sel#409 daemon prio=5 os_prio=0 
tid=0x00007ff1682fc800 nid=0x943c waiting for monitor entry [0x00007ff0dae3a000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:620)
- waiting to lock <0x00000006dcb3db50> (a java.lang.Object)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:564)
at 
org.apache.flink.streaming.runtime.tasks.SourceStreamTask.triggerCheckpoint(SourceStreamTask.java:116)
at org.apache.flink.runtime.taskmanager.Task$1.run(Task.java:1229)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
- <0x00000006ee7095f0> (a java.util.concurrent.ThreadPoolExecutor$Worker)

Reply via email to