Hi Ilan,

I think so, using CLI instead of REST API should solve this, as the user
code execution would be pulled out to a separate JVM. If you're going to
try that, it would be great to hear back whether it has solved your issue.

As for 1.13.4, there is currently no on-going effort / concrete plan on the
release.

Best,
D.

On Sun, Dec 5, 2021 at 4:06 PM Ilan Huchansky <ilan.huchan...@start.io>
wrote:

> Hi David,
>
>
>
> Thanks for your fast response.
>
>
>
> Do you think that changing the submission method could solve the problem?
> Using the CLI instead of the REST API.
>
>
>
> Another question, I see that the most critical issue (FLINK-25022) is in
> progress and should be released on with version 1.13.4 , do you know when
> this version is planned to be released?
>
>
>
> Thanks again,
>
> Ilan.
>
>
>
> *From: *David Morávek <d...@apache.org>
> *Date: *Thursday, 2 December 2021 at 17:25
> *To: *Ilan Huchansky <ilan.huchan...@start.io>
> *Cc: *user@flink.apache.org <user@flink.apache.org>, Start.io SDP <
> s...@start.io>
> *Subject: *Re: Unable to create new native thread error
>
> Hi Ilan,
>
>
>
> we are aware of multiple issues when web-submission can result in
> classloader / thread local leaks, which could potentially result in the
> behavior you're describing. We're working on addressing them.
>
>
>
> FLINK-25022 [1]: The most critical one leaking thread locals.
> FLINK-25027 [2]: Is only a memory improvement for a particular situation
> (a lot of small batch jobs) and could be fixed by accounting for when
> setting Metaspace size.
> FLINK-25023 [3]: Can leak the classloader of the first job submitted via
> rest API. (constant overhead for Metaspace)
>
>
>
> In general, web-submission is different from a normal submission in way,
> that the "main method" of the uploaded jar is executed on JobManager and
> it's really hard to isolate it's execution from possible side effects.
>
>
>
> [1] https://issues.apache.org/jira/browse/FLINK-25022
>
> [2] https://issues.apache.org/jira/browse/FLINK-25027
>
> [3] https://issues.apache.org/jira/browse/FLINK-25023
>
>
>
> Best,
>
> D.
>
>
>
> On Thu, Dec 2, 2021 at 3:51 PM Ilan Huchansky <ilan.huchan...@start.io>
> wrote:
>
> *Hi Flink mailing list,*
>
>
>
> I am Ilan from Start.io data platform team, need some guidance.
>
>
>
> We have a flow with the following use case:
>
>
>
>    - We read files from AWS S3 buckets process them on our cluster and
>    sink the data into files using Flink file sink.
>    - The jobs use always the same jar, we uploaded it to every job
>    manager on the cluster.
>    - We are submitting jobs constantly through the REST API.
>    - Each job reads one or more files from S3.
>    - The jobs can run from 20 seconds up to 3.5 hours.
>    - The jobs run on batch mode
>    - Running flink 1.13.1
>    - We are running in cluster mode using docker, same machines are being
>    used for task and job manager.
>
>
>
>  We are struggling with the same error, over and over again. We encounter
> it in the job manager and in the task manager.
>
>
>
> After a while that the cluster is running and jobs are finishing correctly
> the task and job manager fail to operate due to:
>
> Caused by: java.lang.OutOfMemoryError: unable to create new native thread.
>
>
>
>
>
> We also see some sporadic failure of java.lang.NoClassDefFoundError, not
> sure it is related.
>
>
>
> Our set up and configuration are as follow:
>
> ·         5 nodes cluster running on docker
>
> ·         Relevant memory config:
>
> jobmanager.memory.heap.size: 1600m
>
> taskmanager.memory.process.size: 231664m
>
> taskmanager.memory.network.fraction: 0.3
>
> taskmanager.memory.jvm-metaspace.size: 10g
>
> jobmanager.memory.jvm-metaspace.size: 2g
>
> taskmanager.memory.framework.off-heap.size: 1g
>
>
>
> ·         Host details
>
> max locked memory  (kbytes, -l) 65536
>
> max memory size       (kbytes, -m) unlimited
>
> open files                     (-n) 1024
>
> max user processes    (-u) 1547269
>
> virtual memory           (kbytes, -v) unlimited
>
> file locks                       (-x) unlimited
>
>
>
> cat /proc/sys/kernel/threads-max: 3094538
>
> kernel.pid_max = 57344
>
>
>
>
>
> We try to increase the max user processes, also to increase and decrease
> the jvm-metaspace.
>
>
>
> Should we keep increasing the max number of processes on the host, Is
> there a way to limit the number of threads from flink config?
>
>
>
> What should we do? Any insights?
> I can provide more information as needed.
>
>
>
> Thanks in advance
>
>
>
>  Ilan
>
>
>
>

Reply via email to