Hi Ilan,

we are aware of multiple issues when web-submission can result in
classloader / thread local leaks, which could potentially result in the
behavior you're describing. We're working on addressing them.

FLINK-25022 [1]: The most critical one leaking thread locals.
FLINK-25027 [2]: Is only a memory improvement for a particular situation (a
lot of small batch jobs) and could be fixed by accounting for when setting
Metaspace size.
FLINK-25023 [3]: Can leak the classloader of the first job submitted via
rest API. (constant overhead for Metaspace)

In general, web-submission is different from a normal submission in way,
that the "main method" of the uploaded jar is executed on JobManager and
it's really hard to isolate it's execution from possible side effects.

[1] https://issues.apache.org/jira/browse/FLINK-25022
[2] https://issues.apache.org/jira/browse/FLINK-25027
[3] https://issues.apache.org/jira/browse/FLINK-25023

Best,
D.

On Thu, Dec 2, 2021 at 3:51 PM Ilan Huchansky <ilan.huchan...@start.io>
wrote:

> *Hi Flink mailing list,*
>
>
>
> I am Ilan from Start.io data platform team, need some guidance.
>
>
>
> We have a flow with the following use case:
>
>
>
>    - We read files from AWS S3 buckets process them on our cluster and
>    sink the data into files using Flink file sink.
>    - The jobs use always the same jar, we uploaded it to every job
>    manager on the cluster.
>    - We are submitting jobs constantly through the REST API.
>    - Each job reads one or more files from S3.
>    - The jobs can run from 20 seconds up to 3.5 hours.
>    - The jobs run on batch mode
>    - Running flink 1.13.1
>    - We are running in cluster mode using docker, same machines are being
>    used for task and job manager.
>
>
>
>  We are struggling with the same error, over and over again. We encounter
> it in the job manager and in the task manager.
>
>
>
> After a while that the cluster is running and jobs are finishing correctly
> the task and job manager fail to operate due to:
>
> Caused by: java.lang.OutOfMemoryError: unable to create new native thread.
>
>
>
>
>
> We also see some sporadic failure of java.lang.NoClassDefFoundError, not
> sure it is related.
>
>
>
> Our set up and configuration are as follow:
>
> ·         5 nodes cluster running on docker
>
> ·         Relevant memory config:
>
> jobmanager.memory.heap.size: 1600m
>
> taskmanager.memory.process.size: 231664m
>
> taskmanager.memory.network.fraction: 0.3
>
> taskmanager.memory.jvm-metaspace.size: 10g
>
> jobmanager.memory.jvm-metaspace.size: 2g
>
> taskmanager.memory.framework.off-heap.size: 1g
>
>
>
> ·         Host details
>
> max locked memory  (kbytes, -l) 65536
>
> max memory size       (kbytes, -m) unlimited
>
> open files                     (-n) 1024
>
> max user processes    (-u) 1547269
>
> virtual memory           (kbytes, -v) unlimited
>
> file locks                       (-x) unlimited
>
>
>
> cat /proc/sys/kernel/threads-max: 3094538
>
> kernel.pid_max = 57344
>
>
>
>
>
> We try to increase the max user processes, also to increase and decrease
> the jvm-metaspace.
>
>
>
> Should we keep increasing the max number of processes on the host, Is
> there a way to limit the number of threads from flink config?
>
>
>
> What should we do? Any insights?
> I can provide more information as needed.
>
>
>
> Thanks in advance
>
>
>
>  Ilan
>
>
>

Reply via email to