[jira] [Updated] (FLINK-37589) Job submission via REST API is not thread safe

Ilya Soin (Jira) Sat, 05 Apr 2025 10:40:10 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-37589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ilya Soin updated FLINK-37589:
------------------------------
    Description: 
Sometimes when Flink K8S Operator deploys more than one job in parallel, some 
jobs are deployed twice, thrice, etc. For example, if 5 jobs are being deployed 
at the same time, instead of jobs 1,2,3,4,5 on the cluster there can be jobs 
1,1,3,4,5 or 1,2,2,3,5 or even 1,2,2,2,5, and so on. It happens all the time 
with python jobs and rarely with other types of jobs. The easiest way to 
reproduce is to deploy 2-3 python jobs on a standalone Flink cluster in 
parallel.

The issue is definitely not in the Operator, has has been discussed in 
FLINK-32592 and FLINK-32552. I was able to fix it by introducing a synchronized 
lock in the 
[JarRunHandler|https://github.com/apache/flink/blob/release-1.20/flink-runtime-web/src/main/java/org/apache/flink/runtime/webmonitor/handlers/JarRunHandler.java#L108]
 like this:
{code:java}
 () -> {
               synchronized (runLock) {
                     return applicationRunner.run(gateway, program, 
effectiveConfiguration);
               }
           },
{code}
I'm not sure if it's the best solution to this problem and could use some 
pointers / discussion. I suspect that we see it mostly on python jobs because 
they take longer to deploy and leave more time to "overlap".

  was:
Sometimes when Flink K8S Operator deploys more than one job in parallel, some 
jobs are deployed twice, thrice, etc. For example, if 5 jobs are being deployed 
at the same time, instead of jobs 1,2,3,4,5 on the cluster there can be jobs 
1,1,3,4,5 or 1,2,2,3,5 or even 1,2,2,2,5, and so on. It happens all the time 
with python jobs and rarely with other types of jobs. The easiest way to 
reproduce is to deploy 2-3 python jobs on a standalone Flink cluster in 
parallel.

The issue is definitely not in the Operator, has has been discussed here and 
here. I was able to fix it by introducing a synchronized lock in the 
[JarRunHandler|https://github.com/apache/flink/blob/release-1.20/flink-runtime-web/src/main/java/org/apache/flink/runtime/webmonitor/handlers/JarRunHandler.java#L108]
 like this:
{code:java}
 () -> {
               synchronized (runLock) {
                     return applicationRunner.run(gateway, program, 
effectiveConfiguration);
               }
           },
{code}
I'm not sure if it's the best solution to this problem and could use some 
pointers / discussion. I suspect that we see it mostly on python jobs because 
they take longer to deploy and leave more time to "overlap".


> Job submission via REST API is not thread safe
> ----------------------------------------------
>
>                 Key: FLINK-37589
>                 URL: https://issues.apache.org/jira/browse/FLINK-37589
>             Project: Flink
>          Issue Type: Bug
>          Components: Client / Job Submission
>    Affects Versions: 1.20.0, 1.20.1
>            Reporter: Ilya Soin
>            Priority: Major
>
> Sometimes when Flink K8S Operator deploys more than one job in parallel, some 
> jobs are deployed twice, thrice, etc. For example, if 5 jobs are being 
> deployed at the same time, instead of jobs 1,2,3,4,5 on the cluster there can 
> be jobs 1,1,3,4,5 or 1,2,2,3,5 or even 1,2,2,2,5, and so on. It happens all 
> the time with python jobs and rarely with other types of jobs. The easiest 
> way to reproduce is to deploy 2-3 python jobs on a standalone Flink cluster 
> in parallel.
> The issue is definitely not in the Operator, has has been discussed in 
> FLINK-32592 and FLINK-32552. I was able to fix it by introducing a 
> synchronized lock in the 
> [JarRunHandler|https://github.com/apache/flink/blob/release-1.20/flink-runtime-web/src/main/java/org/apache/flink/runtime/webmonitor/handlers/JarRunHandler.java#L108]
>  like this:
> {code:java}
>  () -> {
>                synchronized (runLock) {
>                      return applicationRunner.run(gateway, program, 
> effectiveConfiguration);
>                }
>            },
> {code}
> I'm not sure if it's the best solution to this problem and could use some 
> pointers / discussion. I suspect that we see it mostly on python jobs because 
> they take longer to deploy and leave more time to "overlap".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-37589) Job submission via REST API is not thread safe

Reply via email to