PS: Also pulling in Nico (CC'd) who is working on the network stack.
On Thu, Aug 17, 2017 at 11:23 AM, Ufuk Celebi <u...@apache.org> wrote: > Hey Gwenhael, > > the network buffers are recycled automatically after a job terminates. > If this does not happen, it would be quite a major bug. > > To help debug this: > > - Which version of Flink are you using? > - Does the job fail immediately after submission or later during execution? > - Is the following correct: the batch job that eventually fails > because of missing network buffers runs without problems if you submit > it to a fresh cluster with the same memory > > The network buffers are recycled after the task managers report the > task being finished. If you immediately submit the next batch there is > a slight chance that the buffers are not recycled yet. As a possible > temporary work around, could you try waiting for a short amount of > time before submitting the next batch? > > I think we should also be able to run the job without splitting it up > after increasing the network memory configuration. Did you already try > this? > > Best, > > Ufuk > > > On Thu, Aug 17, 2017 at 10:38 AM, Gwenhael Pasquiers > <gwenhael.pasqui...@ericsson.com> wrote: >> Hello, >> >> >> >> We’re meeting a limit with the numberOfBuffers. >> >> >> >> In a quite complex job we do a lot of operations, with a lot of operators, >> on a lot of folders (datehours). >> >> >> >> In order to split the job into smaller “batches” (to limit the necessary >> “numberOfBuffers”) I’ve done a loop over the batches (handle the datehours 3 >> by 3), for each batch I create a new env then call the execute() method. >> >> >> >> However it looks like there is no cleanup : after a while, if the number of >> batches is too big, there is an error saying that the numberOfBuffers isn’t >> high enough. It kinds of looks like some leak. Is there a way to clean them >> up ?