Hi Yufei,

My prime suspect would be changes to the memory configuration introduced in
1.11 [1]

Piotrek

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.11/release-notes/flink-1.11.html#memory-management

pon., 28 gru 2020 o 09:52 Till Rohrmann <trohrm...@apache.org> napisał(a):

> Hi Yufei,
>
> I cannot remember exactly the changes in this area between Flink 1.10.0 and
> Flink 1.12.0. It sounds a bit as if we were not releasing memory segments
> fast enough or had a memory leak. One thing to try out is to increase the
> restart delay to see whether it is the first problem. Alternatively, you
> can also try to bisect the commits in between these versions. If you have a
> test failing reliably, this shouldn't take too long. Maybe Piotr knows
> about a fix which could have solved this problem.
>
> Cheers,
> Till
>
> On Fri, Dec 25, 2020 at 3:05 AM Yangze Guo <karma...@gmail.com> wrote:
>
> > Hi, Yufei.
> >
> > Can you reproduce this issue in 1.10.0? The deterministic slot sharing
> > introduced in 1.12.0 is one possible reason. Before 1.12.0, the
> > distribution of tasks in slots is not determined. Even if the network
> > buffers are enough from the perspective of the cluster. Bad
> > distribution of tasks can lead to the "insufficient network buffer" as
> > well.
> >
> > Best,
> > Yangze Guo
> >
> > On Fri, Dec 25, 2020 at 12:54 AM Yufei Liu <liuyufei9...@gmail.com>
> wrote:
> > >
> > > Hey,
> > > I’ve found that job will throw “java.io.IOException: Insufficient
> number
> > of network buffers: required 51, but only 1 available” after job
> retstart,
> > and I’ve observed TM use much more network buffers than before.
> > > My internal branch is under 1.10.0 can easily  reproduce, but I use
> > 1.12.0 doesn’t have this issue. I Think maybe was already fixed after
> some
> > PR, I'm curious about what can lead to this problem?
> > >
> > > Best.
> > > YuFei.
> >
>

Reply via email to