We are also seeing something very similar. Looks like a bug. It seems to get stuck in LocalBufferPool forever and the job has to be restarted.
Is anyone else facing this too? On Tue, Apr 9, 2019 at 9:04 PM Indraneel R <vascodaga...@gmail.com> wrote: > Hi, > > We are trying to run a very simple flink pipeline, which is used to > sessionize events from a kinesis stream. Its an > - event time window with a 30 min gap and > - trigger interval of 15 mins and > - late arrival time duration of 10 hrs > This is how the graph looks. > > [image: Screenshot 2019-04-10 at 12.08.25 AM.png] > But what we are observing is that after 2-3 days of continuous run the job > becomes progressively unstable and completely freezes. > > And the thread dump analysis revealed that it is actually indefinitely > waiting at > `LocalBufferPool.requestMemorySegment(LocalBufferPool.java:261)` > for a memory segment to be available. > And while it is waiting it holds and checkpoint lock, and therefore blocks > all other threads as well, since they are all requesting for a lock on > `checkpointLock` object. > > But we are not able to figure out why its not able to get any segment. > Because there is no indication of backpressure, at least on the flink UI. > And here are our job configurations: > > *number of Taskmanagers : 4* > *jobmanager.heap.size: 8000m* > *taskmanager.heap.size: 11000m* > *taskmanager.numberOfTaskSlots: 4* > *parallelism.default: 16* > *taskmanager.network.memory.max: 5gb* > *taskmanager.network.memory.min: 3gb* > *taskmanager.network.memory.buffers-per-channel: 8* > *taskmanager.network.memory.floating-buffers-per-gate: 16* > *taskmanager.memory.size: 13gb * > > *data rate : 250 messages/sec* > *or 1mb/sec* > > > Any ideas on what could be the issue? > > regards > -Indraneel > > >