Hey Gwenhael, the network buffers are recycled automatically after a job terminates. If this does not happen, it would be quite a major bug.
To help debug this: - Which version of Flink are you using? - Does the job fail immediately after submission or later during execution? - Is the following correct: the batch job that eventually fails because of missing network buffers runs without problems if you submit it to a fresh cluster with the same memory The network buffers are recycled after the task managers report the task being finished. If you immediately submit the next batch there is a slight chance that the buffers are not recycled yet. As a possible temporary work around, could you try waiting for a short amount of time before submitting the next batch? I think we should also be able to run the job without splitting it up after increasing the network memory configuration. Did you already try this? Best, Ufuk On Thu, Aug 17, 2017 at 10:38 AM, Gwenhael Pasquiers <gwenhael.pasqui...@ericsson.com> wrote: > Hello, > > > > We’re meeting a limit with the numberOfBuffers. > > > > In a quite complex job we do a lot of operations, with a lot of operators, > on a lot of folders (datehours). > > > > In order to split the job into smaller “batches” (to limit the necessary > “numberOfBuffers”) I’ve done a loop over the batches (handle the datehours 3 > by 3), for each batch I create a new env then call the execute() method. > > > > However it looks like there is no cleanup : after a while, if the number of > batches is too big, there is an error saying that the numberOfBuffers isn’t > high enough. It kinds of looks like some leak. Is there a way to clean them > up ?