[ 
https://issues.apache.org/jira/browse/FLINK-32552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabio Wanner updated FLINK-32552:
---------------------------------
    Description: 
*Context*

In the scope of end-to-end tests we deploy all the Flink session jobs we have 
regularly in a staging environment. Some of the jobs are bundled together in 
one helm chart and therefore deployed at the same time. There are around 40 
individual Flink jobs (running on the same Flink session cluster). The session 
cluster is individual for each e2e test run. The problems described below 
happen scarcely (1 in ~ 50 run maybe).

*Problem*

Rarely the operator seems to "mix up" the deployments. This can be seen in the 
Flink cluster logs as multiple {{Received JobGraph submission '<JOB NAME>' 
(<JOB_ID>)}} logs are created from jobs with the same job_id. This results in 
errors such as:

{{DuplicateJobSubmissionException}} or {{ClassNotFoundException.}}

It' also visible in the FlinkSessionJob resource: status.jobStatus.jobName does 
not match the expected job name of the job being deployed (The job name is 
passed to the application via argument).

So far we were unable to reliably reproduce the error.

*Details*

The following lines show the status of 3 jobs form the view point of the Flink 
cluster dashboard, and the FlinkSessionJob ressource:

 

*aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615*

Apache Flink Dashboard:
 * State: Restarting
 * ID: a7d36f3881f943a00000000000000002
 * Exceptions: Cannot load user class: aelps.pipelines.aletsch.smc.SMCUrlMapper

FlinkSessionJob Ressource:
 * State: RUNNING
 * jobId: a1221c743367497b0000000000000002
 * uid: a1221c74-3367-497b-ad2f-8793ab23919d

 

*aletsch_mat_e5730831db8092adb12f5189c4c895ef3a268615*

Apache Flink Dashboard:
 * State: -
 * ID: -

FlinkSessionJob Ressource:
 * State: UPGRADING
 * jobId: -
 * uid: a7d36f38-81f9-43a0-898f-19b950430e9d

Flink K8s Operator:
 * Exceptions: DuplicateJobSubmissionException: Job has already been submitted.

 

*aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615*

Apache Flink Dashboard:
 * State: Running
 * ID: e692c2dfaa18441c0000000000000002
 * Exceptions: -

FlinkSessionJob Ressource:
 * State: RUNNING
 * jobId: e692c2dfaa18441c0000000000000002
 * uid: e692c2df-aa18-441c-a352-88aefa9a3017

As we can see the *aletsch_smc* job is presumably running according to the 
FlinkSessionJob resource, but crash-looping in the cluster and it has the jobID 
matching the uid of the resource of {*}aletsch_mat{*}. While *aletsch_mat* is 
not even running. The following logs also show some suspicious entries: There 
are several {{Received JobGraph submission}} from different jobs with the same 
jobID.

 

*Logs*

The logs are filtered by the 3 jobIds from above.

 

JobID: a7d36f3881f943a00000000000000002
{code:bash}
Flink Cluster
    ...
    023-07-06 10:23:50,552 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job 
aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 
(a7d36f3881f943a00000000000000002) switched from state RUNNING to RESTARTING.
    2023-07-06 10:23:50     file: 
'/tmp/tm_10.0.11.159:6122-e9fadc/blobStorage/job_a7d36f3881f943a00000000000000002/blob_p-40c7a30adef8868254191d2cf2dbc4cb7ab46f0d-8a02a0583d91c5e8e6c94f378aa444c2'
 (valid JAR)
    2023-07-06 10:23:50,522 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Received resource requirements from job a7d36f3881f943a00000000000000002: 
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
numberOfRequiredSlots=4}]
    2023-07-06 10:23:50,522 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Received resource requirements from job a7d36f3881f943a00000000000000002: 
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
numberOfRequiredSlots=3}]
    2023-07-06 10:23:50,522 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Received resource requirements from job a7d36f3881f943a00000000000000002: 
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
numberOfRequiredSlots=2}]
    2023-07-06 10:23:50,522 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Received resource requirements from job a7d36f3881f943a00000000000000002: 
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
numberOfRequiredSlots=1}]
    2023-07-06 10:23:50,512 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job 
aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 
(a7d36f3881f943a00000000000000002) switched from state RESTARTING to RUNNING.
    2023-07-06 10:23:48,979 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Clearing resource requirements of job a7d36f3881f943a00000000000000002
    2023-07-06 10:23:48,853 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Received resource requirements from job a7d36f3881f943a00000000000000002: 
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
numberOfRequiredSlots=1}]
    2023-07-06 10:23:48,853 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Received resource requirements from job a7d36f3881f943a00000000000000002: 
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
numberOfRequiredSlots=2}]
    2023-07-06 10:23:48,853 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Received resource requirements from job a7d36f3881f943a00000000000000002: 
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
numberOfRequiredSlots=3}]
    2023-07-06 10:23:48     file: 
'/tmp/tm_10.0.11.159:6122-e9fadc/blobStorage/job_a7d36f3881f943a00000000000000002/blob_p-40c7a30adef8868254191d2cf2dbc4cb7ab46f0d-8a02a0583d91c5e8e6c94f378aa444c2'
 (valid JAR)
    2023-07-06 10:23:48,661 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job 
aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 
(a7d36f3881f943a00000000000000002) switched from state RUNNING to RESTARTING.
    2023-07-06 10:23:48,583 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Received resource requirements from job a7d36f3881f943a00000000000000002: 
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
numberOfRequiredSlots=4}]
    2023-07-06 10:23:48,583 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Received resource requirements from job a7d36f3881f943a00000000000000002: 
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
numberOfRequiredSlots=3}]
    2023-07-06 10:23:48,583 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Received resource requirements from job a7d36f3881f943a00000000000000002: 
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
numberOfRequiredSlots=2}]
    2023-07-06 10:23:48,582 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Received resource requirements from job a7d36f3881f943a00000000000000002: 
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
numberOfRequiredSlots=1}]
    2023-07-06 10:23:48,573 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job 
aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 
(a7d36f3881f943a00000000000000002) switched from state RESTARTING to RUNNING.
    2023-07-06 10:23:47,562 INFO  
org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Received 
JobGraph submission 'aletsch_mat_e5730831db8092adb12f5189c4c895ef3a268615' 
(a7d36f3881f943a00000000000000002).
    2023-07-06 10:23:47,518 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Clearing resource requirements of job a7d36f3881f943a00000000000000002
    2023-07-06 10:23:47,517 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Received resource requirements from job a7d36f3881f943a00000000000000002: 
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
numberOfRequiredSlots=1}]
    2023-07-06 10:23:47,517 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Received resource requirements from job a7d36f3881f943a00000000000000002: 
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
numberOfRequiredSlots=2}]
    2023-07-06 10:23:47,516 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Received resource requirements from job a7d36f3881f943a00000000000000002: 
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
numberOfRequiredSlots=3}]
    2023-07-06 10:23:47,463 INFO  
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] - 
Submitting Job with JobId=a7d36f3881f943a00000000000000002.
    2023-07-06 10:23:47,463 INFO  
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] - 
Job a7d36f3881f943a00000000000000002 is submitted.
    2023-07-06 10:23:47,104 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job 
aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 
(a7d36f3881f943a00000000000000002) switched from state RUNNING to RESTARTING.
    2023-07-06 10:23:46,804 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Offer 
reserved slots to the leader of job a7d36f3881f943a00000000000000002.
    2023-07-06 10:23:46,804 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Establish 
JobManager connection for job a7d36f3881f943a00000000000000002.
    2023-07-06 10:23:46,799 INFO  
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Successful 
registration at job manager 
akka.tcp://flink@10.0.11.158:6123/user/rpc/jobmanager_2 for job 
a7d36f3881f943a00000000000000002.
    2023-07-06 10:23:46,577 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Receive slot 
request 221b24b50413805c9e35d7620b8a00b8 for job 
a7d36f3881f943a00000000000000002 from resource manager with leader id 
aaa9331f70b07a195b5f09d57d1b40c5.
    2023-07-06 10:23:46,577 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Receive slot 
request 49d3c8cd1080bd38c0144c3d3cc597cd for job 
a7d36f3881f943a00000000000000002 from resource manager with leader id 
aaa9331f70b07a195b5f09d57d1b40c5.
    2023-07-06 10:23:46,577 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Receive slot 
request 819f34cc8957066478fb4b3549367d24 for job 
a7d36f3881f943a00000000000000002 from resource manager with leader id 
aaa9331f70b07a195b5f09d57d1b40c5.
    2023-07-06 10:23:46,574 INFO  
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Add job 
a7d36f3881f943a00000000000000002 for job leader monitoring.
    2023-07-06 10:23:46,570 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Receive slot 
request 36802a7de1487f3fb1b6a3b509bd5e20 for job 
a7d36f3881f943a00000000000000002 from resource manager with leader id 
aaa9331f70b07a195b5f09d57d1b40c5.
    2023-07-06 10:23:46,560 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Received resource requirements from job a7d36f3881f943a00000000000000002: 
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
numberOfRequiredSlots=4}]
    2023-07-06 10:23:46,556 INFO  
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - 
Registered job manager 
aaa9331f70b07a195b5f09d57d1b4...@akka.tcp://flink@10.0.11.158:6123/user/rpc/jobmanager_2
 for job a7d36f3881f943a00000000000000002.
    2023-07-06 10:23:46,528 INFO  
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - 
Registering job manager 
aaa9331f70b07a195b5f09d57d1b4...@akka.tcp://flink@10.0.11.158:6123/user/rpc/jobmanager_2
 for job a7d36f3881f943a00000000000000002.
    2023-07-06 10:23:46,480 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job 
aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 
(a7d36f3881f943a00000000000000002) switched from state CREATED to RUNNING.
    2023-07-06 10:23:46,476 INFO  org.apache.flink.runtime.jobmaster.JobMaster  
               [] - Starting execution of job 
'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615' 
(a7d36f3881f943a00000000000000002) under job master id 
aaa9331f70b07a195b5f09d57d1b40c5.
    2023-07-06 10:23:46,466 INFO  org.apache.flink.runtime.jobmaster.JobMaster  
               [] - Using failover strategy 
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@62877000
 for aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 
(a7d36f3881f943a00000000000000002).
    2023-07-06 10:23:46,079 INFO  org.apache.flink.runtime.jobmaster.JobMaster  
               [] - Running initialization on master for job 
aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 
(a7d36f3881f943a00000000000000002).
    2023-07-06 10:23:46,059 INFO  
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] - 
Found 0 checkpoints in 
KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-a7d36f3881f943a00000000000000002-config-map'}.
    2023-07-06 10:23:46,051 INFO  
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] - 
Recovering checkpoints from 
KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-a7d36f3881f943a00000000000000002-config-map'}.
    2023-07-06 10:23:46,006 INFO  org.apache.flink.runtime.jobmaster.JobMaster  
               [] - Using restart back off time strategy 
ExponentialDelayRestartBackoffTimeStrategy(initialBackoffMS=1000, 
maxBackoffMS=300000, backoffMultiplier=2.0, resetBackoffThresholdMS=3600000, 
jitterFactor=0.5, currentBackoffMS=1000, lastFailureTimestamp=0) for 
aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 
(a7d36f3881f943a00000000000000002).
    2023-07-06 10:23:45,987 INFO  org.apache.flink.runtime.jobmaster.JobMaster  
               [] - Initializing job 
'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615' 
(a7d36f3881f943a00000000000000002).
    2023-07-06 10:23:45,966 INFO  
org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Received 
JobGraph submission 'aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615' 
(a7d36f3881f943a00000000000000002).
    2023-07-06 10:23:45,965 INFO  
org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Received 
JobGraph submission 'aletsch_mat_e5730831db8092adb12f5189c4c895ef3a268615' 
(a7d36f3881f943a00000000000000002).
    2023-07-06 10:23:45,915 INFO  
org.apache.flink.runtime.jobmanager.DefaultJobGraphStore     [] - Added 
JobGraph(jobId: a7d36f3881f943a00000000000000002) to 
KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-cluster-config-map'}.
    2023-07-06 10:23:45,859 INFO  
org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Submitting 
job 'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615' 
(a7d36f3881f943a00000000000000002).
    2023-07-06 10:23:45,857 INFO  
org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Received 
JobGraph submission 'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615' 
(a7d36f3881f943a00000000000000002).
    2023-07-06 10:23:45,705 INFO  
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] - 
Submitting Job with JobId=a7d36f3881f943a00000000000000002.
    2023-07-06 10:23:45,705 INFO  
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] - 
Job a7d36f3881f943a00000000000000002 is submitted.
    2023-07-06 10:23:45,705 INFO  
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] - 
Submitting Job with JobId=a7d36f3881f943a00000000000000002.
    2023-07-06 10:23:45,705 INFO  
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] - 
Job a7d36f3881f943a00000000000000002 is submitted.
    2023-07-06 10:23:45,705 INFO  
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] - 
Submitting Job with JobId=a7d36f3881f943a00000000000000002.
    2023-07-06 10:23:45,705 INFO  
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] - 
Job a7d36f3881f943a00000000000000002 is submitted.

    Flink Operator
    2023-07-06 10:26:25,792 o.a.f.k.o.s.AbstractFlinkService [INFO 
][aelps-staging/aletsch-mat-staging-e5730831] Submitting job: 
a7d36f3881f943a00000000000000002 to session cluster.
    2023-07-06 10:25:05,163 o.a.f.k.o.s.AbstractFlinkService [INFO 
][aelps-staging/aletsch-mat-staging-e5730831] Submitting job: 
a7d36f3881f943a00000000000000002 to session cluster.
    2023-07-06 10:24:24,553 o.a.f.k.o.s.AbstractFlinkService [INFO 
][aelps-staging/aletsch-mat-staging-e5730831] Submitting job: 
a7d36f3881f943a00000000000000002 to session cluster.
    2023-07-06 10:24:03,850 o.a.f.k.o.s.AbstractFlinkService [INFO 
][aelps-staging/aletsch-mat-staging-e5730831] Submitting job: 
a7d36f3881f943a00000000000000002 to session cluster.
    2023-07-06 10:23:53,094 o.a.f.k.o.s.AbstractFlinkService [INFO 
][aelps-staging/aletsch-mat-staging-e5730831] Submitting job: 
a7d36f3881f943a00000000000000002 to session cluster.
    2023-07-06 10:23:47,346 o.a.f.k.o.s.AbstractFlinkService [INFO 
][aelps-staging/aletsch-mat-staging-e5730831] Submitting job: 
a7d36f3881f943a00000000000000002 to session cluster.
    2023-07-06 10:23:45,372 o.a.f.k.o.s.AbstractFlinkService [INFO 
][aelps-staging/aletsch-mat-staging-e5730831] Submitting job: 
a7d36f3881f943a00000000000000002 to session cluster.
{code}
 
JobID: a1221c743367497b0000000000000002
{code:bash}
Flink Cluster
    2023-07-06 11:23:48,062 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed 
checkpoint 1 for job a1221c743367497b0000000000000002 (48548 bytes, 
checkpointDuration=107 ms, finalizationTime=33 ms).
    2023-07-06 11:23:47,937 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering 
checkpoint 1 (type=CheckpointType{name='Checkpoint', 
sharingFilesStrategy=FORWARD_BACKWARD}) @ 1688635427922 for job 
a1221c743367497b0000000000000002.
    2023-07-06 10:23:48,567 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Offer 
reserved slots to the leader of job a1221c743367497b0000000000000002.
    2023-07-06 10:23:48,567 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Establish 
JobManager connection for job a1221c743367497b0000000000000002.
    2023-07-06 10:23:48,567 INFO  
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Successful 
registration at job manager 
akka.tcp://flink@10.0.11.158:6123/user/rpc/jobmanager_7 for job 
a1221c743367497b0000000000000002.
    2023-07-06 10:23:48,009 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Receive slot 
request cae6932e2409d5fece3f6b4636e3c71a for job 
a1221c743367497b0000000000000002 from resource manager with leader id 
aaa9331f70b07a195b5f09d57d1b40c5.
    2023-07-06 10:23:48,003 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Receive slot 
request 8a57f3ecff07d300aebb33f6b3545aed for job 
a1221c743367497b0000000000000002 from resource manager with leader id 
aaa9331f70b07a195b5f09d57d1b40c5.
    2023-07-06 10:23:48,003 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Receive slot 
request 7a4a0cfd16eec4a1cb043cce5f989db0 for job 
a1221c743367497b0000000000000002 from resource manager with leader id 
aaa9331f70b07a195b5f09d57d1b40c5.
    2023-07-06 10:23:48,002 INFO  
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Add job 
a1221c743367497b0000000000000002 for job leader monitoring.
    2023-07-06 10:23:48,002 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Receive slot 
request 92cbc64513fa703e4acf28bbb3088a58 for job 
a1221c743367497b0000000000000002 from resource manager with leader id 
aaa9331f70b07a195b5f09d57d1b40c5.
    2023-07-06 10:23:48,999 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Received resource requirements from job a1221c743367497b0000000000000002: 
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
numberOfRequiredSlots=4}]
    2023-07-06 10:23:47,998 INFO  
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - 
Registered job manager 
aaa9331f70b07a195b5f09d57d1b4...@akka.tcp://flink@10.0.11.158:6123/user/rpc/jobmanager_7
 for job a1221c743367497b0000000000000002.
    2023-07-06 10:23:47,953 INFO  
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - 
Registering job manager 
aaa9331f70b07a195b5f09d57d1b4...@akka.tcp://flink@10.0.11.158:6123/user/rpc/jobmanager_7
 for job a1221c743367497b0000000000000002.
    2023-07-06 10:23:47,922 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job 
aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 
(a1221c743367497b0000000000000002) switched from state CREATED to RUNNING.
    2023-07-06 10:23:47,887 INFO  org.apache.flink.runtime.jobmaster.JobMaster  
               [] - Starting execution of job 
'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615' 
(a1221c743367497b0000000000000002) under job master id 
aaa9331f70b07a195b5f09d57d1b40c5.
    2023-07-06 10:23:47,887 INFO  org.apache.flink.runtime.jobmaster.JobMaster  
               [] - Using failover strategy 
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@2222ba4d
 for aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 
(a1221c743367497b0000000000000002).
    2023-07-06 10:23:47,880 INFO  org.apache.flink.runtime.jobmaster.JobMaster  
               [] - Running initialization on master for job 
aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 
(a1221c743367497b0000000000000002).
    2023-07-06 10:23:47,872 INFO  
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] - 
Found 0 checkpoints in 
KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-a1221c743367497b0000000000000002-config-map'}.
    2023-07-06 10:23:47,867 INFO  
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] - 
Recovering checkpoints from 
KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-a1221c743367497b0000000000000002-config-map'}.
    2023-07-06 10:23:47,832 INFO  org.apache.flink.runtime.jobmaster.JobMaster  
               [] - Using restart back off time strategy 
ExponentialDelayRestartBackoffTimeStrategy(initialBackoffMS=1000, 
maxBackoffMS=300000, backoffMultiplier=2.0, resetBackoffThresholdMS=3600000, 
jitterFactor=0.5, currentBackoffMS=1000, lastFailureTimestamp=0) for 
aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 
(a1221c743367497b0000000000000002).
    2023-07-06 10:23:47,832 INFO  org.apache.flink.runtime.jobmaster.JobMaster  
               [] - Initializing job 
'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615' 
(a1221c743367497b0000000000000002).
    2023-07-06 10:23:47,820 INFO  
org.apache.flink.runtime.jobmanager.DefaultJobGraphStore     [] - Added 
JobGraph(jobId: a1221c743367497b0000000000000002) to 
KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-cluster-config-map'}.
    2023-07-06 10:23:47,780 INFO  
org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Submitting 
job 'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615' 
(a1221c743367497b0000000000000002).
    2023-07-06 10:23:47,776 INFO  
org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Received 
JobGraph submission 'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615' 
(a1221c743367497b0000000000000002).
    2023-07-06 10:23:47,668 INFO  
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] - 
Submitting Job with JobId=a1221c743367497b0000000000000002.
    2023-07-06 10:23:47,668 INFO  
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] - 
Job a1221c743367497b0000000000000002 is submitted.

    Flink Operator
    2023-07-06 10:23:48,007 o.a.f.k.o.s.AbstractFlinkService [INFO 
][aelps-staging/aletsch-smc-staging-e5730831] Submitted job: 
a1221c743367497b0000000000000002 to session cluster.
    2023-07-06 10:23:47,505 o.a.f.k.o.s.AbstractFlinkService [INFO 
][aelps-staging/aletsch-smc-staging-e5730831] Submitting job: 
a1221c743367497b0000000000000002 to session cluster.
    2023-07-06 10:23:45,416 o.a.f.k.o.s.AbstractFlinkService [INFO 
][aelps-staging/aletsch-smc-staging-e5730831] Submitting job: 
a1221c743367497b0000000000000002 to session cluster.
{code}
JobID: e692c2dfaa18441c0000000000000002
{code:bash}
Flink Cluster
    2023-07-06 11:23:48,004 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed 
checkpoint 1 for job e692c2dfaa18441c0000000000000002 (8194 bytes, 
checkpointDuration=125 ms, finalizationTime=28 ms).
    2023-07-06 11:23:47,867 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering 
checkpoint 1 (type=CheckpointType{name='Checkpoint', 
sharingFilesStrategy=FORWARD_BACKWARD}) @ 1688635427851 for job 
e692c2dfaa18441c0000000000000002.
    2023-07-06 10:23:48,568 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Offer 
reserved slots to the leader of job e692c2dfaa18441c0000000000000002.
    2023-07-06 10:23:48,568 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Establish 
JobManager connection for job e692c2dfaa18441c0000000000000002.
    2023-07-06 10:23:48,568 INFO  
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Successful 
registration at job manager 
akka.tcp://flink@10.0.11.158:6123/user/rpc/jobmanager_6 for job 
e692c2dfaa18441c0000000000000002.
    2023-07-06 10:23:48,002 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Receive slot 
request 5e5a0e55fac280bf31abf29a20bce684 for job 
e692c2dfaa18441c0000000000000002 from resource manager with leader id 
aaa9331f70b07a195b5f09d57d1b40c5.
    2023-07-06 10:23:48,002 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Receive slot 
request 1cdbce54f4376a1df86430f97dab6858 for job 
e692c2dfaa18441c0000000000000002 from resource manager with leader id 
aaa9331f70b07a195b5f09d57d1b40c5.
    2023-07-06 10:23:48,002 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Receive slot 
request 352db7288d0e4d1775d5f52dd14c769d for job 
e692c2dfaa18441c0000000000000002 from resource manager with leader id 
aaa9331f70b07a195b5f09d57d1b40c5.
    2023-07-06 10:23:48,001 INFO  
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Add job 
e692c2dfaa18441c0000000000000002 for job leader monitoring.
    2023-07-06 10:23:48,000 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Receive slot 
request bffed3e4a4c8573049a4119bd7e15f19 for job 
e692c2dfaa18441c0000000000000002 from resource manager with leader id 
aaa9331f70b07a195b5f09d57d1b40c5.
    2023-07-06 10:23:48,998 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Received resource requirements from job e692c2dfaa18441c0000000000000002: 
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
numberOfRequiredSlots=4}]
    2023-07-06 10:23:47,998 INFO  
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - 
Registered job manager 
aaa9331f70b07a195b5f09d57d1b4...@akka.tcp://flink@10.0.11.158:6123/user/rpc/jobmanager_6
 for job e692c2dfaa18441c0000000000000002.
    2023-07-06 10:23:47,953 INFO  
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - 
Registering job manager 
aaa9331f70b07a195b5f09d57d1b4...@akka.tcp://flink@10.0.11.158:6123/user/rpc/jobmanager_6
 for job e692c2dfaa18441c0000000000000002.
    2023-07-06 10:23:47,851 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job 
aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615 
(e692c2dfaa18441c0000000000000002) switched from state CREATED to RUNNING.
    2023-07-06 10:23:47,845 INFO  org.apache.flink.runtime.jobmaster.JobMaster  
               [] - Starting execution of job 
'aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615' 
(e692c2dfaa18441c0000000000000002) under job master id 
aaa9331f70b07a195b5f09d57d1b40c5.
    2023-07-06 10:23:47,844 INFO  org.apache.flink.runtime.jobmaster.JobMaster  
               [] - Using failover strategy 
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@7eeab246
 for aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615 
(e692c2dfaa18441c0000000000000002).
    2023-07-06 10:23:47,834 INFO  org.apache.flink.runtime.jobmaster.JobMaster  
               [] - Running initialization on master for job 
aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615 
(e692c2dfaa18441c0000000000000002).
    2023-07-06 10:23:47,825 INFO  
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] - 
Found 0 checkpoints in 
KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-e692c2dfaa18441c0000000000000002-config-map'}.
    2023-07-06 10:23:47,813 INFO  
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] - 
Recovering checkpoints from 
KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-e692c2dfaa18441c0000000000000002-config-map'}.
    2023-07-06 10:23:47,782 INFO  org.apache.flink.runtime.jobmaster.JobMaster  
               [] - Using restart back off time strategy 
ExponentialDelayRestartBackoffTimeStrategy(initialBackoffMS=1000, 
maxBackoffMS=300000, backoffMultiplier=2.0, resetBackoffThresholdMS=3600000, 
jitterFactor=0.5, currentBackoffMS=1000, lastFailureTimestamp=0) for 
aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615 
(e692c2dfaa18441c0000000000000002).
    2023-07-06 10:23:47,781 INFO  org.apache.flink.runtime.jobmaster.JobMaster  
               [] - Initializing job 
'aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615' 
(e692c2dfaa18441c0000000000000002).
    2023-07-06 10:23:47,774 INFO  
org.apache.flink.runtime.jobmanager.DefaultJobGraphStore     [] - Added 
JobGraph(jobId: e692c2dfaa18441c0000000000000002) to 
KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-cluster-config-map'}.
    2023-07-06 10:23:47,703 INFO  
org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Submitting 
job 'aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615' 
(e692c2dfaa18441c0000000000000002).
    2023-07-06 10:23:47,702 INFO  
org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Received 
JobGraph submission 'aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615' 
(e692c2dfaa18441c0000000000000002).
    2023-07-06 10:23:47,650 INFO  
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] - 
Submitting Job with JobId=e692c2dfaa18441c0000000000000002.
    2023-07-06 10:23:47,650 INFO  
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] - 
Job e692c2dfaa18441c0000000000000002 is submitted.

    Flink Operator
    2023-07-06 10:23:47,973 o.a.f.k.o.s.AbstractFlinkService [INFO 
][aelps-staging/aletsch-wp-wafer-staging-e5730831] Submitted job: 
e692c2dfaa18441c0000000000000002 to session cluster.
    2023-07-06 10:23:47,505 o.a.f.k.o.s.AbstractFlinkService [INFO 
][aelps-staging/aletsch-wp-wafer-staging-e5730831] Submitting job: 
e692c2dfaa18441c0000000000000002 to session cluster.
    2023-07-06 10:23:45,374 o.a.f.k.o.s.AbstractFlinkService [INFO 
][aelps-staging/aletsch-wp-wafer-staging-e5730831] Submitting job: 
e692c2dfaa18441c0000000000000002 to session cluster.
{code}

  was:
*Context*

In the scope of end-to-end tests we deploy all the Flink session jobs we have 
regularly in a staging environment. Some of the jobs are bundled together in 
one helm chart and therefore deployed at the same time. There are around 40 
individual Flink jobs (running on the same Flink session cluster). The session 
cluster is individual for each e2e test run. The problems described below 
happen scarcely (1 in ~ 50 run maybe).

*Problem*

Rarely the operator seems to "mix up" the deployments. This can be seen in the 
Flink cluster logs as multiple {{Received JobGraph submission '<JOB NAME>' 
(<JOB_ID>)}} logs are created from jobs with the same job_id. This results in 
errors such as:

{{DuplicateJobSubmissionException}} or {{ClassNotFoundException.}}

It' also visible in the FlinkSessionJob resource: status.jobStatus.jobName does 
not match the expected job name of the job being deployed (The job name is 
passed to the application via argument).

So far we were unable to reliably reproduce the error.

*Details*

The following lines show the status of 3 jobs form the view point of the Flink 
cluster dashboard, and the FlinkSessionJob ressource:

 

*aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615*

Apache Flink Dashboard:
 * State: Restarting
 * ID: a7d36f3881f943a00000000000000002
 * Exceptions: Cannot load user class: aelps.pipelines.aletsch.smc.SMCUrlMapper

FlinkSessionJob Ressource:
 * State: RUNNING
 * jobId: a1221c743367497b0000000000000002
 * uid: a1221c74-3367-497b-ad2f-8793ab23919d

 

*aletsch_mat_e5730831db8092adb12f5189c4c895ef3a268615*

Apache Flink Dashboard:
 * State: -
 * ID: -

FlinkSessionJob Ressource:
 * State: UPGRADING
 * jobId: -
 * uid: a7d36f38-81f9-43a0-898f-19b950430e9d

Flink K8s Operator:
 * Exceptions: DuplicateJobSubmissionException: Job has already been submitted.

 

*aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615*

Apache Flink Dashboard:
 * State: Running
 * ID: e692c2dfaa18441c0000000000000002
 * Exceptions: -

FlinkSessionJob Ressource:
 * State: RUNNING
 * jobId: e692c2dfaa18441c0000000000000002
 * uid: e692c2df-aa18-441c-a352-88aefa9a3017

As we can see the *aletsch_smc* job is presumably running according to the 
FlinkSessionJob resource, but crash-looping in the cluster and it has the jobID 
matching the uid of the resource of {*}aletsch_mat{*}. While *aletsch_mat* is 
not even running. The following logs also show some suspicious entries: There 
are several {{Received JobGraph submission}} from different jobs with the same 
jobID.

 

*Logs*

The logs are filtered by the 3 jobIds from above.

 

JobID: a7d36f3881f943a00000000000000002
{code:bash}
Flink Cluster
    ...
    023-07-06 10:23:50,552 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job 
aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 
(a7d36f3881f943a00000000000000002) switched from state RUNNING to RESTARTING.
    2023-07-06 10:23:50     file: 
'/tmp/tm_10.0.11.159:6122-e9fadc/blobStorage/job_a7d36f3881f943a00000000000000002/blob_p-40c7a30adef8868254191d2cf2dbc4cb7ab46f0d-8a02a0583d91c5e8e6c94f378aa444c2'
 (valid JAR)
    2023-07-06 10:23:50,522 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Received resource requirements from job a7d36f3881f943a00000000000000002: 
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
numberOfRequiredSlots=4}]
    2023-07-06 10:23:50,522 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Received resource requirements from job a7d36f3881f943a00000000000000002: 
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
numberOfRequiredSlots=3}]
    2023-07-06 10:23:50,522 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Received resource requirements from job a7d36f3881f943a00000000000000002: 
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
numberOfRequiredSlots=2}]
    2023-07-06 10:23:50,522 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Received resource requirements from job a7d36f3881f943a00000000000000002: 
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
numberOfRequiredSlots=1}]
    2023-07-06 10:23:50,512 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job 
aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 
(a7d36f3881f943a00000000000000002) switched from state RESTARTING to RUNNING.
    2023-07-06 10:23:48,979 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Clearing resource requirements of job a7d36f3881f943a00000000000000002
    2023-07-06 10:23:48,853 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Received resource requirements from job a7d36f3881f943a00000000000000002: 
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
numberOfRequiredSlots=1}]
    2023-07-06 10:23:48,853 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Received resource requirements from job a7d36f3881f943a00000000000000002: 
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
numberOfRequiredSlots=2}]
    2023-07-06 10:23:48,853 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Received resource requirements from job a7d36f3881f943a00000000000000002: 
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
numberOfRequiredSlots=3}]
    2023-07-06 10:23:48     file: 
'/tmp/tm_10.0.11.159:6122-e9fadc/blobStorage/job_a7d36f3881f943a00000000000000002/blob_p-40c7a30adef8868254191d2cf2dbc4cb7ab46f0d-8a02a0583d91c5e8e6c94f378aa444c2'
 (valid JAR)
    2023-07-06 10:23:48,661 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job 
aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 
(a7d36f3881f943a00000000000000002) switched from state RUNNING to RESTARTING.
    2023-07-06 10:23:48,583 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Received resource requirements from job a7d36f3881f943a00000000000000002: 
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
numberOfRequiredSlots=4}]
    2023-07-06 10:23:48,583 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Received resource requirements from job a7d36f3881f943a00000000000000002: 
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
numberOfRequiredSlots=3}]
    2023-07-06 10:23:48,583 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Received resource requirements from job a7d36f3881f943a00000000000000002: 
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
numberOfRequiredSlots=2}]
    2023-07-06 10:23:48,582 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Received resource requirements from job a7d36f3881f943a00000000000000002: 
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
numberOfRequiredSlots=1}]
    2023-07-06 10:23:48,573 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job 
aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 
(a7d36f3881f943a00000000000000002) switched from state RESTARTING to RUNNING.
    2023-07-06 10:23:47,562 INFO  
org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Received 
JobGraph submission 'aletsch_mat_e5730831db8092adb12f5189c4c895ef3a268615' 
(a7d36f3881f943a00000000000000002).
    2023-07-06 10:23:47,518 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Clearing resource requirements of job a7d36f3881f943a00000000000000002
    2023-07-06 10:23:47,517 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Received resource requirements from job a7d36f3881f943a00000000000000002: 
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
numberOfRequiredSlots=1}]
    2023-07-06 10:23:47,517 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Received resource requirements from job a7d36f3881f943a00000000000000002: 
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
numberOfRequiredSlots=2}]
    2023-07-06 10:23:47,516 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Received resource requirements from job a7d36f3881f943a00000000000000002: 
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
numberOfRequiredSlots=3}]
    2023-07-06 10:23:47,463 INFO  
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] - 
Submitting Job with JobId=a7d36f3881f943a00000000000000002.
    2023-07-06 10:23:47,463 INFO  
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] - 
Job a7d36f3881f943a00000000000000002 is submitted.
    2023-07-06 10:23:47,104 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job 
aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 
(a7d36f3881f943a00000000000000002) switched from state RUNNING to RESTARTING.
    2023-07-06 10:23:46,804 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Offer 
reserved slots to the leader of job a7d36f3881f943a00000000000000002.
    2023-07-06 10:23:46,804 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Establish 
JobManager connection for job a7d36f3881f943a00000000000000002.
    2023-07-06 10:23:46,799 INFO  
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Successful 
registration at job manager 
akka.tcp://flink@10.0.11.158:6123/user/rpc/jobmanager_2 for job 
a7d36f3881f943a00000000000000002.
    2023-07-06 10:23:46,577 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Receive slot 
request 221b24b50413805c9e35d7620b8a00b8 for job 
a7d36f3881f943a00000000000000002 from resource manager with leader id 
aaa9331f70b07a195b5f09d57d1b40c5.
    2023-07-06 10:23:46,577 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Receive slot 
request 49d3c8cd1080bd38c0144c3d3cc597cd for job 
a7d36f3881f943a00000000000000002 from resource manager with leader id 
aaa9331f70b07a195b5f09d57d1b40c5.
    2023-07-06 10:23:46,577 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Receive slot 
request 819f34cc8957066478fb4b3549367d24 for job 
a7d36f3881f943a00000000000000002 from resource manager with leader id 
aaa9331f70b07a195b5f09d57d1b40c5.
    2023-07-06 10:23:46,574 INFO  
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Add job 
a7d36f3881f943a00000000000000002 for job leader monitoring.
    2023-07-06 10:23:46,570 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Receive slot 
request 36802a7de1487f3fb1b6a3b509bd5e20 for job 
a7d36f3881f943a00000000000000002 from resource manager with leader id 
aaa9331f70b07a195b5f09d57d1b40c5.
    2023-07-06 10:23:46,560 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Received resource requirements from job a7d36f3881f943a00000000000000002: 
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
numberOfRequiredSlots=4}]
    2023-07-06 10:23:46,556 INFO  
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - 
Registered job manager 
aaa9331f70b07a195b5f09d57d1b4...@akka.tcp://flink@10.0.11.158:6123/user/rpc/jobmanager_2
 for job a7d36f3881f943a00000000000000002.
    2023-07-06 10:23:46,528 INFO  
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - 
Registering job manager 
aaa9331f70b07a195b5f09d57d1b4...@akka.tcp://flink@10.0.11.158:6123/user/rpc/jobmanager_2
 for job a7d36f3881f943a00000000000000002.
    2023-07-06 10:23:46,480 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job 
aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 
(a7d36f3881f943a00000000000000002) switched from state CREATED to RUNNING.
    2023-07-06 10:23:46,476 INFO  org.apache.flink.runtime.jobmaster.JobMaster  
               [] - Starting execution of job 
'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615' 
(a7d36f3881f943a00000000000000002) under job master id 
aaa9331f70b07a195b5f09d57d1b40c5.
    2023-07-06 10:23:46,466 INFO  org.apache.flink.runtime.jobmaster.JobMaster  
               [] - Using failover strategy 
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@62877000
 for aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 
(a7d36f3881f943a00000000000000002).
    2023-07-06 10:23:46,079 INFO  org.apache.flink.runtime.jobmaster.JobMaster  
               [] - Running initialization on master for job 
aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 
(a7d36f3881f943a00000000000000002).
    2023-07-06 10:23:46,059 INFO  
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] - 
Found 0 checkpoints in 
KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-a7d36f3881f943a00000000000000002-config-map'}.
    2023-07-06 10:23:46,051 INFO  
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] - 
Recovering checkpoints from 
KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-a7d36f3881f943a00000000000000002-config-map'}.
    2023-07-06 10:23:46,006 INFO  org.apache.flink.runtime.jobmaster.JobMaster  
               [] - Using restart back off time strategy 
ExponentialDelayRestartBackoffTimeStrategy(initialBackoffMS=1000, 
maxBackoffMS=300000, backoffMultiplier=2.0, resetBackoffThresholdMS=3600000, 
jitterFactor=0.5, currentBackoffMS=1000, lastFailureTimestamp=0) for 
aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 
(a7d36f3881f943a00000000000000002).
    2023-07-06 10:23:45,987 INFO  org.apache.flink.runtime.jobmaster.JobMaster  
               [] - Initializing job 
'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615' 
(a7d36f3881f943a00000000000000002).
    2023-07-06 10:23:45,966 INFO  
org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Received 
JobGraph submission 'aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615' 
(a7d36f3881f943a00000000000000002).
    2023-07-06 10:23:45,965 INFO  
org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Received 
JobGraph submission 'aletsch_mat_e5730831db8092adb12f5189c4c895ef3a268615' 
(a7d36f3881f943a00000000000000002).
    2023-07-06 10:23:45,915 INFO  
org.apache.flink.runtime.jobmanager.DefaultJobGraphStore     [] - Added 
JobGraph(jobId: a7d36f3881f943a00000000000000002) to 
KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-cluster-config-map'}.
    2023-07-06 10:23:45,859 INFO  
org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Submitting 
job 'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615' 
(a7d36f3881f943a00000000000000002).
    2023-07-06 10:23:45,857 INFO  
org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Received 
JobGraph submission 'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615' 
(a7d36f3881f943a00000000000000002).
    2023-07-06 10:23:45,705 INFO  
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] - 
Submitting Job with JobId=a7d36f3881f943a00000000000000002.
    2023-07-06 10:23:45,705 INFO  
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] - 
Job a7d36f3881f943a00000000000000002 is submitted.
    2023-07-06 10:23:45,705 INFO  
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] - 
Submitting Job with JobId=a7d36f3881f943a00000000000000002.
    2023-07-06 10:23:45,705 INFO  
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] - 
Job a7d36f3881f943a00000000000000002 is submitted.
    2023-07-06 10:23:45,705 INFO  
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] - 
Submitting Job with JobId=a7d36f3881f943a00000000000000002.
    2023-07-06 10:23:45,705 INFO  
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] - 
Job a7d36f3881f943a00000000000000002 is submitted.

    Flink Operator
    2023-07-06 10:26:25 2023-07-06 08:26:25,792 
o.a.f.k.o.s.AbstractFlinkService [INFO 
][aelps-staging/aletsch-mat-staging-e5730831] Submitting job: 
a7d36f3881f943a00000000000000002 to session cluster.
    2023-07-06 10:25:05,163 o.a.f.k.o.s.AbstractFlinkService [INFO 
][aelps-staging/aletsch-mat-staging-e5730831] Submitting job: 
a7d36f3881f943a00000000000000002 to session cluster.
    2023-07-06 10:24:24,553 o.a.f.k.o.s.AbstractFlinkService [INFO 
][aelps-staging/aletsch-mat-staging-e5730831] Submitting job: 
a7d36f3881f943a00000000000000002 to session cluster.
    2023-07-06 10:24:03,850 o.a.f.k.o.s.AbstractFlinkService [INFO 
][aelps-staging/aletsch-mat-staging-e5730831] Submitting job: 
a7d36f3881f943a00000000000000002 to session cluster.
    2023-07-06 10:23:53,094 o.a.f.k.o.s.AbstractFlinkService [INFO 
][aelps-staging/aletsch-mat-staging-e5730831] Submitting job: 
a7d36f3881f943a00000000000000002 to session cluster.
    2023-07-06 10:23:47,346 o.a.f.k.o.s.AbstractFlinkService [INFO 
][aelps-staging/aletsch-mat-staging-e5730831] Submitting job: 
a7d36f3881f943a00000000000000002 to session cluster.
    2023-07-06 10:23:45,372 o.a.f.k.o.s.AbstractFlinkService [INFO 
][aelps-staging/aletsch-mat-staging-e5730831] Submitting job: 
a7d36f3881f943a00000000000000002 to session cluster.
{code}
 
JobID: a1221c743367497b0000000000000002
{code:bash}
Flink Cluster
    2023-07-06 11:23:48,062 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed 
checkpoint 1 for job a1221c743367497b0000000000000002 (48548 bytes, 
checkpointDuration=107 ms, finalizationTime=33 ms).
    2023-07-06 11:23:47,937 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering 
checkpoint 1 (type=CheckpointType{name='Checkpoint', 
sharingFilesStrategy=FORWARD_BACKWARD}) @ 1688635427922 for job 
a1221c743367497b0000000000000002.
    2023-07-06 10:23:48,567 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Offer 
reserved slots to the leader of job a1221c743367497b0000000000000002.
    2023-07-06 10:23:48,567 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Establish 
JobManager connection for job a1221c743367497b0000000000000002.
    2023-07-06 10:23:48,567 INFO  
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Successful 
registration at job manager 
akka.tcp://flink@10.0.11.158:6123/user/rpc/jobmanager_7 for job 
a1221c743367497b0000000000000002.
    2023-07-06 10:23:48,009 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Receive slot 
request cae6932e2409d5fece3f6b4636e3c71a for job 
a1221c743367497b0000000000000002 from resource manager with leader id 
aaa9331f70b07a195b5f09d57d1b40c5.
    2023-07-06 10:23:48,003 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Receive slot 
request 8a57f3ecff07d300aebb33f6b3545aed for job 
a1221c743367497b0000000000000002 from resource manager with leader id 
aaa9331f70b07a195b5f09d57d1b40c5.
    2023-07-06 10:23:48,003 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Receive slot 
request 7a4a0cfd16eec4a1cb043cce5f989db0 for job 
a1221c743367497b0000000000000002 from resource manager with leader id 
aaa9331f70b07a195b5f09d57d1b40c5.
    2023-07-06 10:23:48,002 INFO  
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Add job 
a1221c743367497b0000000000000002 for job leader monitoring.
    2023-07-06 10:23:48,002 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Receive slot 
request 92cbc64513fa703e4acf28bbb3088a58 for job 
a1221c743367497b0000000000000002 from resource manager with leader id 
aaa9331f70b07a195b5f09d57d1b40c5.
    2023-07-06 10:23:48,999 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Received resource requirements from job a1221c743367497b0000000000000002: 
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
numberOfRequiredSlots=4}]
    2023-07-06 10:23:47,998 INFO  
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - 
Registered job manager 
aaa9331f70b07a195b5f09d57d1b4...@akka.tcp://flink@10.0.11.158:6123/user/rpc/jobmanager_7
 for job a1221c743367497b0000000000000002.
    2023-07-06 10:23:47,953 INFO  
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - 
Registering job manager 
aaa9331f70b07a195b5f09d57d1b4...@akka.tcp://flink@10.0.11.158:6123/user/rpc/jobmanager_7
 for job a1221c743367497b0000000000000002.
    2023-07-06 10:23:47,922 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job 
aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 
(a1221c743367497b0000000000000002) switched from state CREATED to RUNNING.
    2023-07-06 10:23:47,887 INFO  org.apache.flink.runtime.jobmaster.JobMaster  
               [] - Starting execution of job 
'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615' 
(a1221c743367497b0000000000000002) under job master id 
aaa9331f70b07a195b5f09d57d1b40c5.
    2023-07-06 10:23:47,887 INFO  org.apache.flink.runtime.jobmaster.JobMaster  
               [] - Using failover strategy 
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@2222ba4d
 for aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 
(a1221c743367497b0000000000000002).
    2023-07-06 10:23:47,880 INFO  org.apache.flink.runtime.jobmaster.JobMaster  
               [] - Running initialization on master for job 
aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 
(a1221c743367497b0000000000000002).
    2023-07-06 10:23:47,872 INFO  
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] - 
Found 0 checkpoints in 
KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-a1221c743367497b0000000000000002-config-map'}.
    2023-07-06 10:23:47,867 INFO  
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] - 
Recovering checkpoints from 
KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-a1221c743367497b0000000000000002-config-map'}.
    2023-07-06 10:23:47,832 INFO  org.apache.flink.runtime.jobmaster.JobMaster  
               [] - Using restart back off time strategy 
ExponentialDelayRestartBackoffTimeStrategy(initialBackoffMS=1000, 
maxBackoffMS=300000, backoffMultiplier=2.0, resetBackoffThresholdMS=3600000, 
jitterFactor=0.5, currentBackoffMS=1000, lastFailureTimestamp=0) for 
aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 
(a1221c743367497b0000000000000002).
    2023-07-06 10:23:47,832 INFO  org.apache.flink.runtime.jobmaster.JobMaster  
               [] - Initializing job 
'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615' 
(a1221c743367497b0000000000000002).
    2023-07-06 10:23:47,820 INFO  
org.apache.flink.runtime.jobmanager.DefaultJobGraphStore     [] - Added 
JobGraph(jobId: a1221c743367497b0000000000000002) to 
KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-cluster-config-map'}.
    2023-07-06 10:23:47,780 INFO  
org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Submitting 
job 'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615' 
(a1221c743367497b0000000000000002).
    2023-07-06 10:23:47,776 INFO  
org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Received 
JobGraph submission 'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615' 
(a1221c743367497b0000000000000002).
    2023-07-06 10:23:47,668 INFO  
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] - 
Submitting Job with JobId=a1221c743367497b0000000000000002.
    2023-07-06 10:23:47,668 INFO  
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] - 
Job a1221c743367497b0000000000000002 is submitted.

    Flink Operator
    2023-07-06 10:23:48,007 o.a.f.k.o.s.AbstractFlinkService [INFO 
][aelps-staging/aletsch-smc-staging-e5730831] Submitted job: 
a1221c743367497b0000000000000002 to session cluster.
    2023-07-06 10:23:47,505 o.a.f.k.o.s.AbstractFlinkService [INFO 
][aelps-staging/aletsch-smc-staging-e5730831] Submitting job: 
a1221c743367497b0000000000000002 to session cluster.
    2023-07-06 10:23:45,416 o.a.f.k.o.s.AbstractFlinkService [INFO 
][aelps-staging/aletsch-smc-staging-e5730831] Submitting job: 
a1221c743367497b0000000000000002 to session cluster.
{code}
JobID: e692c2dfaa18441c0000000000000002
{code:bash}
Flink Cluster
    2023-07-06 11:23:48,004 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed 
checkpoint 1 for job e692c2dfaa18441c0000000000000002 (8194 bytes, 
checkpointDuration=125 ms, finalizationTime=28 ms).
    2023-07-06 11:23:47,867 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering 
checkpoint 1 (type=CheckpointType{name='Checkpoint', 
sharingFilesStrategy=FORWARD_BACKWARD}) @ 1688635427851 for job 
e692c2dfaa18441c0000000000000002.
    2023-07-06 10:23:48,568 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Offer 
reserved slots to the leader of job e692c2dfaa18441c0000000000000002.
    2023-07-06 10:23:48,568 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Establish 
JobManager connection for job e692c2dfaa18441c0000000000000002.
    2023-07-06 10:23:48,568 INFO  
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Successful 
registration at job manager 
akka.tcp://flink@10.0.11.158:6123/user/rpc/jobmanager_6 for job 
e692c2dfaa18441c0000000000000002.
    2023-07-06 10:23:48,002 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Receive slot 
request 5e5a0e55fac280bf31abf29a20bce684 for job 
e692c2dfaa18441c0000000000000002 from resource manager with leader id 
aaa9331f70b07a195b5f09d57d1b40c5.
    2023-07-06 10:23:48,002 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Receive slot 
request 1cdbce54f4376a1df86430f97dab6858 for job 
e692c2dfaa18441c0000000000000002 from resource manager with leader id 
aaa9331f70b07a195b5f09d57d1b40c5.
    2023-07-06 10:23:48,002 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Receive slot 
request 352db7288d0e4d1775d5f52dd14c769d for job 
e692c2dfaa18441c0000000000000002 from resource manager with leader id 
aaa9331f70b07a195b5f09d57d1b40c5.
    2023-07-06 10:23:48,001 INFO  
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Add job 
e692c2dfaa18441c0000000000000002 for job leader monitoring.
    2023-07-06 10:23:48,000 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Receive slot 
request bffed3e4a4c8573049a4119bd7e15f19 for job 
e692c2dfaa18441c0000000000000002 from resource manager with leader id 
aaa9331f70b07a195b5f09d57d1b40c5.
    2023-07-06 10:23:48,998 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Received resource requirements from job e692c2dfaa18441c0000000000000002: 
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
numberOfRequiredSlots=4}]
    2023-07-06 10:23:47,998 INFO  
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - 
Registered job manager 
aaa9331f70b07a195b5f09d57d1b4...@akka.tcp://flink@10.0.11.158:6123/user/rpc/jobmanager_6
 for job e692c2dfaa18441c0000000000000002.
    2023-07-06 10:23:47,953 INFO  
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - 
Registering job manager 
aaa9331f70b07a195b5f09d57d1b4...@akka.tcp://flink@10.0.11.158:6123/user/rpc/jobmanager_6
 for job e692c2dfaa18441c0000000000000002.
    2023-07-06 10:23:47,851 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job 
aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615 
(e692c2dfaa18441c0000000000000002) switched from state CREATED to RUNNING.
    2023-07-06 10:23:47,845 INFO  org.apache.flink.runtime.jobmaster.JobMaster  
               [] - Starting execution of job 
'aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615' 
(e692c2dfaa18441c0000000000000002) under job master id 
aaa9331f70b07a195b5f09d57d1b40c5.
    2023-07-06 10:23:47,844 INFO  org.apache.flink.runtime.jobmaster.JobMaster  
               [] - Using failover strategy 
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@7eeab246
 for aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615 
(e692c2dfaa18441c0000000000000002).
    2023-07-06 10:23:47,834 INFO  org.apache.flink.runtime.jobmaster.JobMaster  
               [] - Running initialization on master for job 
aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615 
(e692c2dfaa18441c0000000000000002).
    2023-07-06 10:23:47,825 INFO  
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] - 
Found 0 checkpoints in 
KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-e692c2dfaa18441c0000000000000002-config-map'}.
    2023-07-06 10:23:47,813 INFO  
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] - 
Recovering checkpoints from 
KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-e692c2dfaa18441c0000000000000002-config-map'}.
    2023-07-06 10:23:47,782 INFO  org.apache.flink.runtime.jobmaster.JobMaster  
               [] - Using restart back off time strategy 
ExponentialDelayRestartBackoffTimeStrategy(initialBackoffMS=1000, 
maxBackoffMS=300000, backoffMultiplier=2.0, resetBackoffThresholdMS=3600000, 
jitterFactor=0.5, currentBackoffMS=1000, lastFailureTimestamp=0) for 
aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615 
(e692c2dfaa18441c0000000000000002).
    2023-07-06 10:23:47,781 INFO  org.apache.flink.runtime.jobmaster.JobMaster  
               [] - Initializing job 
'aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615' 
(e692c2dfaa18441c0000000000000002).
    2023-07-06 10:23:47,774 INFO  
org.apache.flink.runtime.jobmanager.DefaultJobGraphStore     [] - Added 
JobGraph(jobId: e692c2dfaa18441c0000000000000002) to 
KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-cluster-config-map'}.
    2023-07-06 10:23:47,703 INFO  
org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Submitting 
job 'aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615' 
(e692c2dfaa18441c0000000000000002).
    2023-07-06 10:23:47,702 INFO  
org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Received 
JobGraph submission 'aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615' 
(e692c2dfaa18441c0000000000000002).
    2023-07-06 10:23:47,650 INFO  
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] - 
Submitting Job with JobId=e692c2dfaa18441c0000000000000002.
    2023-07-06 10:23:47,650 INFO  
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] - 
Job e692c2dfaa18441c0000000000000002 is submitted.

    Flink Operator
    2023-07-06 10:23:47,973 o.a.f.k.o.s.AbstractFlinkService [INFO 
][aelps-staging/aletsch-wp-wafer-staging-e5730831] Submitted job: 
e692c2dfaa18441c0000000000000002 to session cluster.
    2023-07-06 10:23:47,505 o.a.f.k.o.s.AbstractFlinkService [INFO 
][aelps-staging/aletsch-wp-wafer-staging-e5730831] Submitting job: 
e692c2dfaa18441c0000000000000002 to session cluster.
    2023-07-06 10:23:45,374 o.a.f.k.o.s.AbstractFlinkService [INFO 
][aelps-staging/aletsch-wp-wafer-staging-e5730831] Submitting job: 
e692c2dfaa18441c0000000000000002 to session cluster.
{code}


> Mixed up Flink session job deployments
> --------------------------------------
>
>                 Key: FLINK-32552
>                 URL: https://issues.apache.org/jira/browse/FLINK-32552
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>            Reporter: Fabio Wanner
>            Priority: Major
>
> *Context*
> In the scope of end-to-end tests we deploy all the Flink session jobs we have 
> regularly in a staging environment. Some of the jobs are bundled together in 
> one helm chart and therefore deployed at the same time. There are around 40 
> individual Flink jobs (running on the same Flink session cluster). The 
> session cluster is individual for each e2e test run. The problems described 
> below happen scarcely (1 in ~ 50 run maybe).
> *Problem*
> Rarely the operator seems to "mix up" the deployments. This can be seen in 
> the Flink cluster logs as multiple {{Received JobGraph submission '<JOB 
> NAME>' (<JOB_ID>)}} logs are created from jobs with the same job_id. This 
> results in errors such as:
> {{DuplicateJobSubmissionException}} or {{ClassNotFoundException.}}
> It' also visible in the FlinkSessionJob resource: status.jobStatus.jobName 
> does not match the expected job name of the job being deployed (The job name 
> is passed to the application via argument).
> So far we were unable to reliably reproduce the error.
> *Details*
> The following lines show the status of 3 jobs form the view point of the 
> Flink cluster dashboard, and the FlinkSessionJob ressource:
>  
> *aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615*
> Apache Flink Dashboard:
>  * State: Restarting
>  * ID: a7d36f3881f943a00000000000000002
>  * Exceptions: Cannot load user class: 
> aelps.pipelines.aletsch.smc.SMCUrlMapper
> FlinkSessionJob Ressource:
>  * State: RUNNING
>  * jobId: a1221c743367497b0000000000000002
>  * uid: a1221c74-3367-497b-ad2f-8793ab23919d
>  
> *aletsch_mat_e5730831db8092adb12f5189c4c895ef3a268615*
> Apache Flink Dashboard:
>  * State: -
>  * ID: -
> FlinkSessionJob Ressource:
>  * State: UPGRADING
>  * jobId: -
>  * uid: a7d36f38-81f9-43a0-898f-19b950430e9d
> Flink K8s Operator:
>  * Exceptions: DuplicateJobSubmissionException: Job has already been 
> submitted.
>  
> *aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615*
> Apache Flink Dashboard:
>  * State: Running
>  * ID: e692c2dfaa18441c0000000000000002
>  * Exceptions: -
> FlinkSessionJob Ressource:
>  * State: RUNNING
>  * jobId: e692c2dfaa18441c0000000000000002
>  * uid: e692c2df-aa18-441c-a352-88aefa9a3017
> As we can see the *aletsch_smc* job is presumably running according to the 
> FlinkSessionJob resource, but crash-looping in the cluster and it has the 
> jobID matching the uid of the resource of {*}aletsch_mat{*}. While 
> *aletsch_mat* is not even running. The following logs also show some 
> suspicious entries: There are several {{Received JobGraph submission}} from 
> different jobs with the same jobID.
>  
> *Logs*
> The logs are filtered by the 3 jobIds from above.
>  
> JobID: a7d36f3881f943a00000000000000002
> {code:bash}
> Flink Cluster
>     ...
>     023-07-06 10:23:50,552 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job 
> aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 
> (a7d36f3881f943a00000000000000002) switched from state RUNNING to RESTARTING.
>     2023-07-06 10:23:50           file: 
> '/tmp/tm_10.0.11.159:6122-e9fadc/blobStorage/job_a7d36f3881f943a00000000000000002/blob_p-40c7a30adef8868254191d2cf2dbc4cb7ab46f0d-8a02a0583d91c5e8e6c94f378aa444c2'
>  (valid JAR)
>     2023-07-06 10:23:50,522 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager 
> [] - Received resource requirements from job 
> a7d36f3881f943a00000000000000002: 
> [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
> numberOfRequiredSlots=4}]
>     2023-07-06 10:23:50,522 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager 
> [] - Received resource requirements from job 
> a7d36f3881f943a00000000000000002: 
> [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
> numberOfRequiredSlots=3}]
>     2023-07-06 10:23:50,522 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager 
> [] - Received resource requirements from job 
> a7d36f3881f943a00000000000000002: 
> [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
> numberOfRequiredSlots=2}]
>     2023-07-06 10:23:50,522 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager 
> [] - Received resource requirements from job 
> a7d36f3881f943a00000000000000002: 
> [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
> numberOfRequiredSlots=1}]
>     2023-07-06 10:23:50,512 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job 
> aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 
> (a7d36f3881f943a00000000000000002) switched from state RESTARTING to RUNNING.
>     2023-07-06 10:23:48,979 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager 
> [] - Clearing resource requirements of job a7d36f3881f943a00000000000000002
>     2023-07-06 10:23:48,853 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager 
> [] - Received resource requirements from job 
> a7d36f3881f943a00000000000000002: 
> [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
> numberOfRequiredSlots=1}]
>     2023-07-06 10:23:48,853 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager 
> [] - Received resource requirements from job 
> a7d36f3881f943a00000000000000002: 
> [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
> numberOfRequiredSlots=2}]
>     2023-07-06 10:23:48,853 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager 
> [] - Received resource requirements from job 
> a7d36f3881f943a00000000000000002: 
> [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
> numberOfRequiredSlots=3}]
>     2023-07-06 10:23:48           file: 
> '/tmp/tm_10.0.11.159:6122-e9fadc/blobStorage/job_a7d36f3881f943a00000000000000002/blob_p-40c7a30adef8868254191d2cf2dbc4cb7ab46f0d-8a02a0583d91c5e8e6c94f378aa444c2'
>  (valid JAR)
>     2023-07-06 10:23:48,661 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job 
> aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 
> (a7d36f3881f943a00000000000000002) switched from state RUNNING to RESTARTING.
>     2023-07-06 10:23:48,583 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager 
> [] - Received resource requirements from job 
> a7d36f3881f943a00000000000000002: 
> [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
> numberOfRequiredSlots=4}]
>     2023-07-06 10:23:48,583 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager 
> [] - Received resource requirements from job 
> a7d36f3881f943a00000000000000002: 
> [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
> numberOfRequiredSlots=3}]
>     2023-07-06 10:23:48,583 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager 
> [] - Received resource requirements from job 
> a7d36f3881f943a00000000000000002: 
> [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
> numberOfRequiredSlots=2}]
>     2023-07-06 10:23:48,582 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager 
> [] - Received resource requirements from job 
> a7d36f3881f943a00000000000000002: 
> [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
> numberOfRequiredSlots=1}]
>     2023-07-06 10:23:48,573 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job 
> aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 
> (a7d36f3881f943a00000000000000002) switched from state RESTARTING to RUNNING.
>     2023-07-06 10:23:47,562 INFO  
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Received 
> JobGraph submission 'aletsch_mat_e5730831db8092adb12f5189c4c895ef3a268615' 
> (a7d36f3881f943a00000000000000002).
>     2023-07-06 10:23:47,518 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager 
> [] - Clearing resource requirements of job a7d36f3881f943a00000000000000002
>     2023-07-06 10:23:47,517 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager 
> [] - Received resource requirements from job 
> a7d36f3881f943a00000000000000002: 
> [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
> numberOfRequiredSlots=1}]
>     2023-07-06 10:23:47,517 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager 
> [] - Received resource requirements from job 
> a7d36f3881f943a00000000000000002: 
> [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
> numberOfRequiredSlots=2}]
>     2023-07-06 10:23:47,516 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager 
> [] - Received resource requirements from job 
> a7d36f3881f943a00000000000000002: 
> [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
> numberOfRequiredSlots=3}]
>     2023-07-06 10:23:47,463 INFO  
> org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] 
> - Submitting Job with JobId=a7d36f3881f943a00000000000000002.
>     2023-07-06 10:23:47,463 INFO  
> org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] 
> - Job a7d36f3881f943a00000000000000002 is submitted.
>     2023-07-06 10:23:47,104 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job 
> aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 
> (a7d36f3881f943a00000000000000002) switched from state RUNNING to RESTARTING.
>     2023-07-06 10:23:46,804 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Offer 
> reserved slots to the leader of job a7d36f3881f943a00000000000000002.
>     2023-07-06 10:23:46,804 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Establish 
> JobManager connection for job a7d36f3881f943a00000000000000002.
>     2023-07-06 10:23:46,799 INFO  
> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Successful 
> registration at job manager 
> akka.tcp://flink@10.0.11.158:6123/user/rpc/jobmanager_2 for job 
> a7d36f3881f943a00000000000000002.
>     2023-07-06 10:23:46,577 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Receive 
> slot request 221b24b50413805c9e35d7620b8a00b8 for job 
> a7d36f3881f943a00000000000000002 from resource manager with leader id 
> aaa9331f70b07a195b5f09d57d1b40c5.
>     2023-07-06 10:23:46,577 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Receive 
> slot request 49d3c8cd1080bd38c0144c3d3cc597cd for job 
> a7d36f3881f943a00000000000000002 from resource manager with leader id 
> aaa9331f70b07a195b5f09d57d1b40c5.
>     2023-07-06 10:23:46,577 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Receive 
> slot request 819f34cc8957066478fb4b3549367d24 for job 
> a7d36f3881f943a00000000000000002 from resource manager with leader id 
> aaa9331f70b07a195b5f09d57d1b40c5.
>     2023-07-06 10:23:46,574 INFO  
> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Add job 
> a7d36f3881f943a00000000000000002 for job leader monitoring.
>     2023-07-06 10:23:46,570 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Receive 
> slot request 36802a7de1487f3fb1b6a3b509bd5e20 for job 
> a7d36f3881f943a00000000000000002 from resource manager with leader id 
> aaa9331f70b07a195b5f09d57d1b40c5.
>     2023-07-06 10:23:46,560 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager 
> [] - Received resource requirements from job 
> a7d36f3881f943a00000000000000002: 
> [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
> numberOfRequiredSlots=4}]
>     2023-07-06 10:23:46,556 INFO  
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - 
> Registered job manager 
> aaa9331f70b07a195b5f09d57d1b4...@akka.tcp://flink@10.0.11.158:6123/user/rpc/jobmanager_2
>  for job a7d36f3881f943a00000000000000002.
>     2023-07-06 10:23:46,528 INFO  
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - 
> Registering job manager 
> aaa9331f70b07a195b5f09d57d1b4...@akka.tcp://flink@10.0.11.158:6123/user/rpc/jobmanager_2
>  for job a7d36f3881f943a00000000000000002.
>     2023-07-06 10:23:46,480 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job 
> aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 
> (a7d36f3881f943a00000000000000002) switched from state CREATED to RUNNING.
>     2023-07-06 10:23:46,476 INFO  
> org.apache.flink.runtime.jobmaster.JobMaster                 [] - Starting 
> execution of job 'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615' 
> (a7d36f3881f943a00000000000000002) under job master id 
> aaa9331f70b07a195b5f09d57d1b40c5.
>     2023-07-06 10:23:46,466 INFO  
> org.apache.flink.runtime.jobmaster.JobMaster                 [] - Using 
> failover strategy 
> org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@62877000
>  for aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 
> (a7d36f3881f943a00000000000000002).
>     2023-07-06 10:23:46,079 INFO  
> org.apache.flink.runtime.jobmaster.JobMaster                 [] - Running 
> initialization on master for job 
> aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 
> (a7d36f3881f943a00000000000000002).
>     2023-07-06 10:23:46,059 INFO  
> org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] - 
> Found 0 checkpoints in 
> KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-a7d36f3881f943a00000000000000002-config-map'}.
>     2023-07-06 10:23:46,051 INFO  
> org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] - 
> Recovering checkpoints from 
> KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-a7d36f3881f943a00000000000000002-config-map'}.
>     2023-07-06 10:23:46,006 INFO  
> org.apache.flink.runtime.jobmaster.JobMaster                 [] - Using 
> restart back off time strategy 
> ExponentialDelayRestartBackoffTimeStrategy(initialBackoffMS=1000, 
> maxBackoffMS=300000, backoffMultiplier=2.0, resetBackoffThresholdMS=3600000, 
> jitterFactor=0.5, currentBackoffMS=1000, lastFailureTimestamp=0) for 
> aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 
> (a7d36f3881f943a00000000000000002).
>     2023-07-06 10:23:45,987 INFO  
> org.apache.flink.runtime.jobmaster.JobMaster                 [] - 
> Initializing job 'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615' 
> (a7d36f3881f943a00000000000000002).
>     2023-07-06 10:23:45,966 INFO  
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Received 
> JobGraph submission 
> 'aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615' 
> (a7d36f3881f943a00000000000000002).
>     2023-07-06 10:23:45,965 INFO  
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Received 
> JobGraph submission 'aletsch_mat_e5730831db8092adb12f5189c4c895ef3a268615' 
> (a7d36f3881f943a00000000000000002).
>     2023-07-06 10:23:45,915 INFO  
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore     [] - Added 
> JobGraph(jobId: a7d36f3881f943a00000000000000002) to 
> KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-cluster-config-map'}.
>     2023-07-06 10:23:45,859 INFO  
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Submitting 
> job 'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615' 
> (a7d36f3881f943a00000000000000002).
>     2023-07-06 10:23:45,857 INFO  
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Received 
> JobGraph submission 'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615' 
> (a7d36f3881f943a00000000000000002).
>     2023-07-06 10:23:45,705 INFO  
> org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] 
> - Submitting Job with JobId=a7d36f3881f943a00000000000000002.
>     2023-07-06 10:23:45,705 INFO  
> org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] 
> - Job a7d36f3881f943a00000000000000002 is submitted.
>     2023-07-06 10:23:45,705 INFO  
> org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] 
> - Submitting Job with JobId=a7d36f3881f943a00000000000000002.
>     2023-07-06 10:23:45,705 INFO  
> org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] 
> - Job a7d36f3881f943a00000000000000002 is submitted.
>     2023-07-06 10:23:45,705 INFO  
> org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] 
> - Submitting Job with JobId=a7d36f3881f943a00000000000000002.
>     2023-07-06 10:23:45,705 INFO  
> org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] 
> - Job a7d36f3881f943a00000000000000002 is submitted.
>     Flink Operator
>     2023-07-06 10:26:25,792 o.a.f.k.o.s.AbstractFlinkService [INFO 
> ][aelps-staging/aletsch-mat-staging-e5730831] Submitting job: 
> a7d36f3881f943a00000000000000002 to session cluster.
>     2023-07-06 10:25:05,163 o.a.f.k.o.s.AbstractFlinkService [INFO 
> ][aelps-staging/aletsch-mat-staging-e5730831] Submitting job: 
> a7d36f3881f943a00000000000000002 to session cluster.
>     2023-07-06 10:24:24,553 o.a.f.k.o.s.AbstractFlinkService [INFO 
> ][aelps-staging/aletsch-mat-staging-e5730831] Submitting job: 
> a7d36f3881f943a00000000000000002 to session cluster.
>     2023-07-06 10:24:03,850 o.a.f.k.o.s.AbstractFlinkService [INFO 
> ][aelps-staging/aletsch-mat-staging-e5730831] Submitting job: 
> a7d36f3881f943a00000000000000002 to session cluster.
>     2023-07-06 10:23:53,094 o.a.f.k.o.s.AbstractFlinkService [INFO 
> ][aelps-staging/aletsch-mat-staging-e5730831] Submitting job: 
> a7d36f3881f943a00000000000000002 to session cluster.
>     2023-07-06 10:23:47,346 o.a.f.k.o.s.AbstractFlinkService [INFO 
> ][aelps-staging/aletsch-mat-staging-e5730831] Submitting job: 
> a7d36f3881f943a00000000000000002 to session cluster.
>     2023-07-06 10:23:45,372 o.a.f.k.o.s.AbstractFlinkService [INFO 
> ][aelps-staging/aletsch-mat-staging-e5730831] Submitting job: 
> a7d36f3881f943a00000000000000002 to session cluster.
> {code}
>  
> JobID: a1221c743367497b0000000000000002
> {code:bash}
> Flink Cluster
>     2023-07-06 11:23:48,062 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed 
> checkpoint 1 for job a1221c743367497b0000000000000002 (48548 bytes, 
> checkpointDuration=107 ms, finalizationTime=33 ms).
>     2023-07-06 11:23:47,937 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering 
> checkpoint 1 (type=CheckpointType{name='Checkpoint', 
> sharingFilesStrategy=FORWARD_BACKWARD}) @ 1688635427922 for job 
> a1221c743367497b0000000000000002.
>     2023-07-06 10:23:48,567 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Offer 
> reserved slots to the leader of job a1221c743367497b0000000000000002.
>     2023-07-06 10:23:48,567 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Establish 
> JobManager connection for job a1221c743367497b0000000000000002.
>     2023-07-06 10:23:48,567 INFO  
> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Successful 
> registration at job manager 
> akka.tcp://flink@10.0.11.158:6123/user/rpc/jobmanager_7 for job 
> a1221c743367497b0000000000000002.
>     2023-07-06 10:23:48,009 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Receive 
> slot request cae6932e2409d5fece3f6b4636e3c71a for job 
> a1221c743367497b0000000000000002 from resource manager with leader id 
> aaa9331f70b07a195b5f09d57d1b40c5.
>     2023-07-06 10:23:48,003 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Receive 
> slot request 8a57f3ecff07d300aebb33f6b3545aed for job 
> a1221c743367497b0000000000000002 from resource manager with leader id 
> aaa9331f70b07a195b5f09d57d1b40c5.
>     2023-07-06 10:23:48,003 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Receive 
> slot request 7a4a0cfd16eec4a1cb043cce5f989db0 for job 
> a1221c743367497b0000000000000002 from resource manager with leader id 
> aaa9331f70b07a195b5f09d57d1b40c5.
>     2023-07-06 10:23:48,002 INFO  
> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Add job 
> a1221c743367497b0000000000000002 for job leader monitoring.
>     2023-07-06 10:23:48,002 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Receive 
> slot request 92cbc64513fa703e4acf28bbb3088a58 for job 
> a1221c743367497b0000000000000002 from resource manager with leader id 
> aaa9331f70b07a195b5f09d57d1b40c5.
>     2023-07-06 10:23:48,999 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager 
> [] - Received resource requirements from job 
> a1221c743367497b0000000000000002: 
> [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
> numberOfRequiredSlots=4}]
>     2023-07-06 10:23:47,998 INFO  
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - 
> Registered job manager 
> aaa9331f70b07a195b5f09d57d1b4...@akka.tcp://flink@10.0.11.158:6123/user/rpc/jobmanager_7
>  for job a1221c743367497b0000000000000002.
>     2023-07-06 10:23:47,953 INFO  
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - 
> Registering job manager 
> aaa9331f70b07a195b5f09d57d1b4...@akka.tcp://flink@10.0.11.158:6123/user/rpc/jobmanager_7
>  for job a1221c743367497b0000000000000002.
>     2023-07-06 10:23:47,922 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job 
> aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 
> (a1221c743367497b0000000000000002) switched from state CREATED to RUNNING.
>     2023-07-06 10:23:47,887 INFO  
> org.apache.flink.runtime.jobmaster.JobMaster                 [] - Starting 
> execution of job 'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615' 
> (a1221c743367497b0000000000000002) under job master id 
> aaa9331f70b07a195b5f09d57d1b40c5.
>     2023-07-06 10:23:47,887 INFO  
> org.apache.flink.runtime.jobmaster.JobMaster                 [] - Using 
> failover strategy 
> org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@2222ba4d
>  for aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 
> (a1221c743367497b0000000000000002).
>     2023-07-06 10:23:47,880 INFO  
> org.apache.flink.runtime.jobmaster.JobMaster                 [] - Running 
> initialization on master for job 
> aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 
> (a1221c743367497b0000000000000002).
>     2023-07-06 10:23:47,872 INFO  
> org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] - 
> Found 0 checkpoints in 
> KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-a1221c743367497b0000000000000002-config-map'}.
>     2023-07-06 10:23:47,867 INFO  
> org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] - 
> Recovering checkpoints from 
> KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-a1221c743367497b0000000000000002-config-map'}.
>     2023-07-06 10:23:47,832 INFO  
> org.apache.flink.runtime.jobmaster.JobMaster                 [] - Using 
> restart back off time strategy 
> ExponentialDelayRestartBackoffTimeStrategy(initialBackoffMS=1000, 
> maxBackoffMS=300000, backoffMultiplier=2.0, resetBackoffThresholdMS=3600000, 
> jitterFactor=0.5, currentBackoffMS=1000, lastFailureTimestamp=0) for 
> aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 
> (a1221c743367497b0000000000000002).
>     2023-07-06 10:23:47,832 INFO  
> org.apache.flink.runtime.jobmaster.JobMaster                 [] - 
> Initializing job 'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615' 
> (a1221c743367497b0000000000000002).
>     2023-07-06 10:23:47,820 INFO  
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore     [] - Added 
> JobGraph(jobId: a1221c743367497b0000000000000002) to 
> KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-cluster-config-map'}.
>     2023-07-06 10:23:47,780 INFO  
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Submitting 
> job 'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615' 
> (a1221c743367497b0000000000000002).
>     2023-07-06 10:23:47,776 INFO  
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Received 
> JobGraph submission 'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615' 
> (a1221c743367497b0000000000000002).
>     2023-07-06 10:23:47,668 INFO  
> org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] 
> - Submitting Job with JobId=a1221c743367497b0000000000000002.
>     2023-07-06 10:23:47,668 INFO  
> org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] 
> - Job a1221c743367497b0000000000000002 is submitted.
>     Flink Operator
>     2023-07-06 10:23:48,007 o.a.f.k.o.s.AbstractFlinkService [INFO 
> ][aelps-staging/aletsch-smc-staging-e5730831] Submitted job: 
> a1221c743367497b0000000000000002 to session cluster.
>     2023-07-06 10:23:47,505 o.a.f.k.o.s.AbstractFlinkService [INFO 
> ][aelps-staging/aletsch-smc-staging-e5730831] Submitting job: 
> a1221c743367497b0000000000000002 to session cluster.
>     2023-07-06 10:23:45,416 o.a.f.k.o.s.AbstractFlinkService [INFO 
> ][aelps-staging/aletsch-smc-staging-e5730831] Submitting job: 
> a1221c743367497b0000000000000002 to session cluster.
> {code}
> JobID: e692c2dfaa18441c0000000000000002
> {code:bash}
> Flink Cluster
>     2023-07-06 11:23:48,004 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed 
> checkpoint 1 for job e692c2dfaa18441c0000000000000002 (8194 bytes, 
> checkpointDuration=125 ms, finalizationTime=28 ms).
>     2023-07-06 11:23:47,867 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering 
> checkpoint 1 (type=CheckpointType{name='Checkpoint', 
> sharingFilesStrategy=FORWARD_BACKWARD}) @ 1688635427851 for job 
> e692c2dfaa18441c0000000000000002.
>     2023-07-06 10:23:48,568 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Offer 
> reserved slots to the leader of job e692c2dfaa18441c0000000000000002.
>     2023-07-06 10:23:48,568 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Establish 
> JobManager connection for job e692c2dfaa18441c0000000000000002.
>     2023-07-06 10:23:48,568 INFO  
> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Successful 
> registration at job manager 
> akka.tcp://flink@10.0.11.158:6123/user/rpc/jobmanager_6 for job 
> e692c2dfaa18441c0000000000000002.
>     2023-07-06 10:23:48,002 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Receive 
> slot request 5e5a0e55fac280bf31abf29a20bce684 for job 
> e692c2dfaa18441c0000000000000002 from resource manager with leader id 
> aaa9331f70b07a195b5f09d57d1b40c5.
>     2023-07-06 10:23:48,002 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Receive 
> slot request 1cdbce54f4376a1df86430f97dab6858 for job 
> e692c2dfaa18441c0000000000000002 from resource manager with leader id 
> aaa9331f70b07a195b5f09d57d1b40c5.
>     2023-07-06 10:23:48,002 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Receive 
> slot request 352db7288d0e4d1775d5f52dd14c769d for job 
> e692c2dfaa18441c0000000000000002 from resource manager with leader id 
> aaa9331f70b07a195b5f09d57d1b40c5.
>     2023-07-06 10:23:48,001 INFO  
> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Add job 
> e692c2dfaa18441c0000000000000002 for job leader monitoring.
>     2023-07-06 10:23:48,000 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Receive 
> slot request bffed3e4a4c8573049a4119bd7e15f19 for job 
> e692c2dfaa18441c0000000000000002 from resource manager with leader id 
> aaa9331f70b07a195b5f09d57d1b40c5.
>     2023-07-06 10:23:48,998 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager 
> [] - Received resource requirements from job 
> e692c2dfaa18441c0000000000000002: 
> [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
> numberOfRequiredSlots=4}]
>     2023-07-06 10:23:47,998 INFO  
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - 
> Registered job manager 
> aaa9331f70b07a195b5f09d57d1b4...@akka.tcp://flink@10.0.11.158:6123/user/rpc/jobmanager_6
>  for job e692c2dfaa18441c0000000000000002.
>     2023-07-06 10:23:47,953 INFO  
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - 
> Registering job manager 
> aaa9331f70b07a195b5f09d57d1b4...@akka.tcp://flink@10.0.11.158:6123/user/rpc/jobmanager_6
>  for job e692c2dfaa18441c0000000000000002.
>     2023-07-06 10:23:47,851 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job 
> aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615 
> (e692c2dfaa18441c0000000000000002) switched from state CREATED to RUNNING.
>     2023-07-06 10:23:47,845 INFO  
> org.apache.flink.runtime.jobmaster.JobMaster                 [] - Starting 
> execution of job 'aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615' 
> (e692c2dfaa18441c0000000000000002) under job master id 
> aaa9331f70b07a195b5f09d57d1b40c5.
>     2023-07-06 10:23:47,844 INFO  
> org.apache.flink.runtime.jobmaster.JobMaster                 [] - Using 
> failover strategy 
> org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@7eeab246
>  for aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615 
> (e692c2dfaa18441c0000000000000002).
>     2023-07-06 10:23:47,834 INFO  
> org.apache.flink.runtime.jobmaster.JobMaster                 [] - Running 
> initialization on master for job 
> aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615 
> (e692c2dfaa18441c0000000000000002).
>     2023-07-06 10:23:47,825 INFO  
> org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] - 
> Found 0 checkpoints in 
> KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-e692c2dfaa18441c0000000000000002-config-map'}.
>     2023-07-06 10:23:47,813 INFO  
> org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] - 
> Recovering checkpoints from 
> KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-e692c2dfaa18441c0000000000000002-config-map'}.
>     2023-07-06 10:23:47,782 INFO  
> org.apache.flink.runtime.jobmaster.JobMaster                 [] - Using 
> restart back off time strategy 
> ExponentialDelayRestartBackoffTimeStrategy(initialBackoffMS=1000, 
> maxBackoffMS=300000, backoffMultiplier=2.0, resetBackoffThresholdMS=3600000, 
> jitterFactor=0.5, currentBackoffMS=1000, lastFailureTimestamp=0) for 
> aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615 
> (e692c2dfaa18441c0000000000000002).
>     2023-07-06 10:23:47,781 INFO  
> org.apache.flink.runtime.jobmaster.JobMaster                 [] - 
> Initializing job 'aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615' 
> (e692c2dfaa18441c0000000000000002).
>     2023-07-06 10:23:47,774 INFO  
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore     [] - Added 
> JobGraph(jobId: e692c2dfaa18441c0000000000000002) to 
> KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-cluster-config-map'}.
>     2023-07-06 10:23:47,703 INFO  
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Submitting 
> job 'aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615' 
> (e692c2dfaa18441c0000000000000002).
>     2023-07-06 10:23:47,702 INFO  
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Received 
> JobGraph submission 
> 'aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615' 
> (e692c2dfaa18441c0000000000000002).
>     2023-07-06 10:23:47,650 INFO  
> org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] 
> - Submitting Job with JobId=e692c2dfaa18441c0000000000000002.
>     2023-07-06 10:23:47,650 INFO  
> org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] 
> - Job e692c2dfaa18441c0000000000000002 is submitted.
>     Flink Operator
>     2023-07-06 10:23:47,973 o.a.f.k.o.s.AbstractFlinkService [INFO 
> ][aelps-staging/aletsch-wp-wafer-staging-e5730831] Submitted job: 
> e692c2dfaa18441c0000000000000002 to session cluster.
>     2023-07-06 10:23:47,505 o.a.f.k.o.s.AbstractFlinkService [INFO 
> ][aelps-staging/aletsch-wp-wafer-staging-e5730831] Submitting job: 
> e692c2dfaa18441c0000000000000002 to session cluster.
>     2023-07-06 10:23:45,374 o.a.f.k.o.s.AbstractFlinkService [INFO 
> ][aelps-staging/aletsch-wp-wafer-staging-e5730831] Submitting job: 
> e692c2dfaa18441c0000000000000002 to session cluster.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to