Hi Ilan, we are aware of multiple issues when web-submission can result in classloader / thread local leaks, which could potentially result in the behavior you're describing. We're working on addressing them.
FLINK-25022 [1]: The most critical one leaking thread locals. FLINK-25027 [2]: Is only a memory improvement for a particular situation (a lot of small batch jobs) and could be fixed by accounting for when setting Metaspace size. FLINK-25023 [3]: Can leak the classloader of the first job submitted via rest API. (constant overhead for Metaspace) In general, web-submission is different from a normal submission in way, that the "main method" of the uploaded jar is executed on JobManager and it's really hard to isolate it's execution from possible side effects. [1] https://issues.apache.org/jira/browse/FLINK-25022 [2] https://issues.apache.org/jira/browse/FLINK-25027 [3] https://issues.apache.org/jira/browse/FLINK-25023 Best, D. On Thu, Dec 2, 2021 at 3:51 PM Ilan Huchansky <ilan.huchan...@start.io> wrote: > *Hi Flink mailing list,* > > > > I am Ilan from Start.io data platform team, need some guidance. > > > > We have a flow with the following use case: > > > > - We read files from AWS S3 buckets process them on our cluster and > sink the data into files using Flink file sink. > - The jobs use always the same jar, we uploaded it to every job > manager on the cluster. > - We are submitting jobs constantly through the REST API. > - Each job reads one or more files from S3. > - The jobs can run from 20 seconds up to 3.5 hours. > - The jobs run on batch mode > - Running flink 1.13.1 > - We are running in cluster mode using docker, same machines are being > used for task and job manager. > > > > We are struggling with the same error, over and over again. We encounter > it in the job manager and in the task manager. > > > > After a while that the cluster is running and jobs are finishing correctly > the task and job manager fail to operate due to: > > Caused by: java.lang.OutOfMemoryError: unable to create new native thread. > > > > > > We also see some sporadic failure of java.lang.NoClassDefFoundError, not > sure it is related. > > > > Our set up and configuration are as follow: > > · 5 nodes cluster running on docker > > · Relevant memory config: > > jobmanager.memory.heap.size: 1600m > > taskmanager.memory.process.size: 231664m > > taskmanager.memory.network.fraction: 0.3 > > taskmanager.memory.jvm-metaspace.size: 10g > > jobmanager.memory.jvm-metaspace.size: 2g > > taskmanager.memory.framework.off-heap.size: 1g > > > > · Host details > > max locked memory (kbytes, -l) 65536 > > max memory size (kbytes, -m) unlimited > > open files (-n) 1024 > > max user processes (-u) 1547269 > > virtual memory (kbytes, -v) unlimited > > file locks (-x) unlimited > > > > cat /proc/sys/kernel/threads-max: 3094538 > > kernel.pid_max = 57344 > > > > > > We try to increase the max user processes, also to increase and decrease > the jvm-metaspace. > > > > Should we keep increasing the max number of processes on the host, Is > there a way to limit the number of threads from flink config? > > > > What should we do? Any insights? > I can provide more information as needed. > > > > Thanks in advance > > > > Ilan > > >