Re: sporadic "Insufficient no of network buffers" issue

2020-08-02 Thread Rahul Patwari
After debugging more, it seems like this issue is caused by the scheduling strategy. Depending on the tasks assigned to the task manager, probably the amount of memory configured for network buffers is running out. Through these references: FLINK-12122

Re: sporadic "Insufficient no of network buffers" issue

2020-08-01 Thread Rahul Patwari
>From the metrics in Prometheus, we observed that the minimum AvailableMemorySegments out of all the task managers is 4.5k when the exception was thrown. So there were enough network buffers. correction to the configs provided above: each TM CPU has 8 cores. Apart from having fewer network buffers

Re: sporadic "Insufficient no of network buffers" issue

2020-07-31 Thread Ivan Yang
Yes, increase the taskmanager.network.memory.fraction in your case. Also reduce the parallelism will reduce number of network buffer required for your job. I never used 1.4.x, so don’t know about it. Ivan > On Jul 31, 2020, at 11:37 PM, Rahul Patwari > wrote: > > Thanks for your reply, Ivan.

Re: sporadic "Insufficient no of network buffers" issue

2020-07-31 Thread Rahul Patwari
Thanks for your reply, Ivan. I think taskmanager.network.memory.max is by default 1GB. In my case, the network buffers memory is 13112 * 32768 = around 400MB which is 10% of the TM memory as by default taskmanager.network.memory.fraction is 0.1. Do you mean to increase taskmanager.network.memory.f

Re: sporadic "Insufficient no of network buffers" issue

2020-07-31 Thread Ivan Yang
Hi Rahul, Try to increase taskmanager.network.memory.max to 1GB, basically double what you have now. However, you only have 4GB RAM for the entire TM, seems out of proportion to have 1GB network buffer with 4GB total RAM. Reducing number of shuffling will require less network buffer. But if you

sporadic "Insufficient no of network buffers" issue

2020-07-31 Thread Rahul Patwari
Hi, We are observing "Insufficient number of Network Buffers" issue Sporadically when Flink is upgraded from 1.4.2 to 1.8.2. The state of the tasks with this issue translated from DEPLOYING to FAILED. Whenever this issue occurs, the job manager restarts. Sometimes, the issue goes away after the re