Hi Yufei, My prime suspect would be changes to the memory configuration introduced in 1.11 [1]
Piotrek [1] https://ci.apache.org/projects/flink/flink-docs-release-1.11/release-notes/flink-1.11.html#memory-management pon., 28 gru 2020 o 09:52 Till Rohrmann <trohrm...@apache.org> napisał(a): > Hi Yufei, > > I cannot remember exactly the changes in this area between Flink 1.10.0 and > Flink 1.12.0. It sounds a bit as if we were not releasing memory segments > fast enough or had a memory leak. One thing to try out is to increase the > restart delay to see whether it is the first problem. Alternatively, you > can also try to bisect the commits in between these versions. If you have a > test failing reliably, this shouldn't take too long. Maybe Piotr knows > about a fix which could have solved this problem. > > Cheers, > Till > > On Fri, Dec 25, 2020 at 3:05 AM Yangze Guo <karma...@gmail.com> wrote: > > > Hi, Yufei. > > > > Can you reproduce this issue in 1.10.0? The deterministic slot sharing > > introduced in 1.12.0 is one possible reason. Before 1.12.0, the > > distribution of tasks in slots is not determined. Even if the network > > buffers are enough from the perspective of the cluster. Bad > > distribution of tasks can lead to the "insufficient network buffer" as > > well. > > > > Best, > > Yangze Guo > > > > On Fri, Dec 25, 2020 at 12:54 AM Yufei Liu <liuyufei9...@gmail.com> > wrote: > > > > > > Hey, > > > I’ve found that job will throw “java.io.IOException: Insufficient > number > > of network buffers: required 51, but only 1 available” after job > retstart, > > and I’ve observed TM use much more network buffers than before. > > > My internal branch is under 1.10.0 can easily reproduce, but I use > > 1.12.0 doesn’t have this issue. I Think maybe was already fixed after > some > > PR, I'm curious about what can lead to this problem? > > > > > > Best. > > > YuFei. > > >