Hi Ilan, I think so, using CLI instead of REST API should solve this, as the user code execution would be pulled out to a separate JVM. If you're going to try that, it would be great to hear back whether it has solved your issue.
As for 1.13.4, there is currently no on-going effort / concrete plan on the release. Best, D. On Sun, Dec 5, 2021 at 4:06 PM Ilan Huchansky <ilan.huchan...@start.io> wrote: > Hi David, > > > > Thanks for your fast response. > > > > Do you think that changing the submission method could solve the problem? > Using the CLI instead of the REST API. > > > > Another question, I see that the most critical issue (FLINK-25022) is in > progress and should be released on with version 1.13.4 , do you know when > this version is planned to be released? > > > > Thanks again, > > Ilan. > > > > *From: *David Morávek <d...@apache.org> > *Date: *Thursday, 2 December 2021 at 17:25 > *To: *Ilan Huchansky <ilan.huchan...@start.io> > *Cc: *user@flink.apache.org <user@flink.apache.org>, Start.io SDP < > s...@start.io> > *Subject: *Re: Unable to create new native thread error > > Hi Ilan, > > > > we are aware of multiple issues when web-submission can result in > classloader / thread local leaks, which could potentially result in the > behavior you're describing. We're working on addressing them. > > > > FLINK-25022 [1]: The most critical one leaking thread locals. > FLINK-25027 [2]: Is only a memory improvement for a particular situation > (a lot of small batch jobs) and could be fixed by accounting for when > setting Metaspace size. > FLINK-25023 [3]: Can leak the classloader of the first job submitted via > rest API. (constant overhead for Metaspace) > > > > In general, web-submission is different from a normal submission in way, > that the "main method" of the uploaded jar is executed on JobManager and > it's really hard to isolate it's execution from possible side effects. > > > > [1] https://issues.apache.org/jira/browse/FLINK-25022 > > [2] https://issues.apache.org/jira/browse/FLINK-25027 > > [3] https://issues.apache.org/jira/browse/FLINK-25023 > > > > Best, > > D. > > > > On Thu, Dec 2, 2021 at 3:51 PM Ilan Huchansky <ilan.huchan...@start.io> > wrote: > > *Hi Flink mailing list,* > > > > I am Ilan from Start.io data platform team, need some guidance. > > > > We have a flow with the following use case: > > > > - We read files from AWS S3 buckets process them on our cluster and > sink the data into files using Flink file sink. > - The jobs use always the same jar, we uploaded it to every job > manager on the cluster. > - We are submitting jobs constantly through the REST API. > - Each job reads one or more files from S3. > - The jobs can run from 20 seconds up to 3.5 hours. > - The jobs run on batch mode > - Running flink 1.13.1 > - We are running in cluster mode using docker, same machines are being > used for task and job manager. > > > > We are struggling with the same error, over and over again. We encounter > it in the job manager and in the task manager. > > > > After a while that the cluster is running and jobs are finishing correctly > the task and job manager fail to operate due to: > > Caused by: java.lang.OutOfMemoryError: unable to create new native thread. > > > > > > We also see some sporadic failure of java.lang.NoClassDefFoundError, not > sure it is related. > > > > Our set up and configuration are as follow: > > · 5 nodes cluster running on docker > > · Relevant memory config: > > jobmanager.memory.heap.size: 1600m > > taskmanager.memory.process.size: 231664m > > taskmanager.memory.network.fraction: 0.3 > > taskmanager.memory.jvm-metaspace.size: 10g > > jobmanager.memory.jvm-metaspace.size: 2g > > taskmanager.memory.framework.off-heap.size: 1g > > > > · Host details > > max locked memory (kbytes, -l) 65536 > > max memory size (kbytes, -m) unlimited > > open files (-n) 1024 > > max user processes (-u) 1547269 > > virtual memory (kbytes, -v) unlimited > > file locks (-x) unlimited > > > > cat /proc/sys/kernel/threads-max: 3094538 > > kernel.pid_max = 57344 > > > > > > We try to increase the max user processes, also to increase and decrease > the jvm-metaspace. > > > > Should we keep increasing the max number of processes on the host, Is > there a way to limit the number of threads from flink config? > > > > What should we do? Any insights? > I can provide more information as needed. > > > > Thanks in advance > > > > Ilan > > > >