Hi , Could you jstak the downstream Task (the Window) and have a look at what the window operator is doing? Best, Guowei
Rahul Jain <rahul...@gmail.com> 于2019年4月10日周三 下午1:04写道: > We are also seeing something very similar. Looks like a bug. > > It seems to get stuck in LocalBufferPool forever and the job has to be > restarted. > > Is anyone else facing this too? > > On Tue, Apr 9, 2019 at 9:04 PM Indraneel R <vascodaga...@gmail.com> wrote: > >> Hi, >> >> We are trying to run a very simple flink pipeline, which is used to >> sessionize events from a kinesis stream. Its an >> - event time window with a 30 min gap and >> - trigger interval of 15 mins and >> - late arrival time duration of 10 hrs >> This is how the graph looks. >> >> [image: Screenshot 2019-04-10 at 12.08.25 AM.png] >> But what we are observing is that after 2-3 days of continuous run the >> job becomes progressively unstable and completely freezes. >> >> And the thread dump analysis revealed that it is actually indefinitely >> waiting at >> `LocalBufferPool.requestMemorySegment(LocalBufferPool.java:261)` >> for a memory segment to be available. >> And while it is waiting it holds and checkpoint lock, and therefore >> blocks all other threads as well, since they are all requesting for a lock >> on `checkpointLock` object. >> >> But we are not able to figure out why its not able to get any segment. >> Because there is no indication of backpressure, at least on the flink UI. >> And here are our job configurations: >> >> *number of Taskmanagers : 4* >> *jobmanager.heap.size: 8000m* >> *taskmanager.heap.size: 11000m* >> *taskmanager.numberOfTaskSlots: 4* >> *parallelism.default: 16* >> *taskmanager.network.memory.max: 5gb* >> *taskmanager.network.memory.min: 3gb* >> *taskmanager.network.memory.buffers-per-channel: 8* >> *taskmanager.network.memory.floating-buffers-per-gate: 16* >> *taskmanager.memory.size: 13gb * >> >> *data rate : 250 messages/sec* >> *or 1mb/sec* >> >> >> Any ideas on what could be the issue? >> >> regards >> -Indraneel >> >> >>