[ https://issues.apache.org/jira/browse/FLINK-32552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Fabio Wanner resolved FLINK-32552. ---------------------------------- Release Note: Not a bug of the flink k8s operator. Resolution: Not A Bug > Mixed up Flink session job deployments > -------------------------------------- > > Key: FLINK-32552 > URL: https://issues.apache.org/jira/browse/FLINK-32552 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator > Reporter: Fabio Wanner > Priority: Major > > *Context* > In the scope of end-to-end tests we deploy all the Flink session jobs we have > regularly in a staging environment. Some of the jobs are bundled together in > one helm chart and therefore deployed at the same time. There are around 40 > individual Flink jobs (running on the same Flink session cluster). The > session cluster is individual for each e2e test run. The problems described > below happen scarcely (1 in ~ 50 run maybe). > *Problem* > Rarely the operator seems to "mix up" the deployments. This can be seen in > the Flink cluster logs as multiple {{Received JobGraph submission '<JOB > NAME>' (<JOB_ID>)}} logs are created from jobs with the same job_id. This > results in errors such as: > {{DuplicateJobSubmissionException}} or {{ClassNotFoundException.}} > It' also visible in the FlinkSessionJob resource: status.jobStatus.jobName > does not match the expected job name of the job being deployed (The job name > is passed to the application via argument). > So far we were unable to reliably reproduce the error. > *Details* > The following lines show the status of 3 jobs form the view point of the > Flink cluster dashboard, and the FlinkSessionJob ressource: > > *aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615* > Apache Flink Dashboard: > * State: Restarting > * ID: a7d36f3881f943a00000000000000002 > * Exceptions: Cannot load user class: > aelps.pipelines.aletsch.smc.SMCUrlMapper > FlinkSessionJob Ressource: > * State: RUNNING > * jobId: a1221c743367497b0000000000000002 > * uid: a1221c74-3367-497b-ad2f-8793ab23919d > > *aletsch_mat_e5730831db8092adb12f5189c4c895ef3a268615* > Apache Flink Dashboard: > * State: - > * ID: - > FlinkSessionJob Ressource: > * State: UPGRADING > * jobId: - > * uid: a7d36f38-81f9-43a0-898f-19b950430e9d > Flink K8s Operator: > * Exceptions: DuplicateJobSubmissionException: Job has already been > submitted. > > *aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615* > Apache Flink Dashboard: > * State: Running > * ID: e692c2dfaa18441c0000000000000002 > * Exceptions: - > FlinkSessionJob Ressource: > * State: RUNNING > * jobId: e692c2dfaa18441c0000000000000002 > * uid: e692c2df-aa18-441c-a352-88aefa9a3017 > As we can see the *aletsch_smc* job is presumably running according to the > FlinkSessionJob resource, but crash-looping in the cluster and it has the > jobID matching the uid of the resource of {*}aletsch_mat{*}. While > *aletsch_mat* is not even running. The following logs also show some > suspicious entries: There are several {{Received JobGraph submission}} from > different jobs with the same jobID. > > *Logs* > The logs are filtered by the 3 jobIds from above. > > JobID: a7d36f3881f943a00000000000000002 > {code:bash} > Flink Cluster > ... > 023-07-06 10:23:50,552 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job > aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 > (a7d36f3881f943a00000000000000002) switched from state RUNNING to RESTARTING. > 2023-07-06 10:23:50 file: > '/tmp/tm_10.0.11.159:6122-e9fadc/blobStorage/job_a7d36f3881f943a00000000000000002/blob_p-40c7a30adef8868254191d2cf2dbc4cb7ab46f0d-8a02a0583d91c5e8e6c94f378aa444c2' > (valid JAR) > 2023-07-06 10:23:50,522 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager > [] - Received resource requirements from job > a7d36f3881f943a00000000000000002: > [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, > numberOfRequiredSlots=4}] > 2023-07-06 10:23:50,522 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager > [] - Received resource requirements from job > a7d36f3881f943a00000000000000002: > [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, > numberOfRequiredSlots=3}] > 2023-07-06 10:23:50,522 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager > [] - Received resource requirements from job > a7d36f3881f943a00000000000000002: > [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, > numberOfRequiredSlots=2}] > 2023-07-06 10:23:50,522 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager > [] - Received resource requirements from job > a7d36f3881f943a00000000000000002: > [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, > numberOfRequiredSlots=1}] > 2023-07-06 10:23:50,512 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job > aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 > (a7d36f3881f943a00000000000000002) switched from state RESTARTING to RUNNING. > 2023-07-06 10:23:48,979 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager > [] - Clearing resource requirements of job a7d36f3881f943a00000000000000002 > 2023-07-06 10:23:48,853 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager > [] - Received resource requirements from job > a7d36f3881f943a00000000000000002: > [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, > numberOfRequiredSlots=1}] > 2023-07-06 10:23:48,853 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager > [] - Received resource requirements from job > a7d36f3881f943a00000000000000002: > [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, > numberOfRequiredSlots=2}] > 2023-07-06 10:23:48,853 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager > [] - Received resource requirements from job > a7d36f3881f943a00000000000000002: > [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, > numberOfRequiredSlots=3}] > 2023-07-06 10:23:48 file: > '/tmp/tm_10.0.11.159:6122-e9fadc/blobStorage/job_a7d36f3881f943a00000000000000002/blob_p-40c7a30adef8868254191d2cf2dbc4cb7ab46f0d-8a02a0583d91c5e8e6c94f378aa444c2' > (valid JAR) > 2023-07-06 10:23:48,661 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job > aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 > (a7d36f3881f943a00000000000000002) switched from state RUNNING to RESTARTING. > 2023-07-06 10:23:48,583 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager > [] - Received resource requirements from job > a7d36f3881f943a00000000000000002: > [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, > numberOfRequiredSlots=4}] > 2023-07-06 10:23:48,583 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager > [] - Received resource requirements from job > a7d36f3881f943a00000000000000002: > [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, > numberOfRequiredSlots=3}] > 2023-07-06 10:23:48,583 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager > [] - Received resource requirements from job > a7d36f3881f943a00000000000000002: > [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, > numberOfRequiredSlots=2}] > 2023-07-06 10:23:48,582 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager > [] - Received resource requirements from job > a7d36f3881f943a00000000000000002: > [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, > numberOfRequiredSlots=1}] > 2023-07-06 10:23:48,573 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job > aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 > (a7d36f3881f943a00000000000000002) switched from state RESTARTING to RUNNING. > 2023-07-06 10:23:47,562 INFO > org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Received > JobGraph submission 'aletsch_mat_e5730831db8092adb12f5189c4c895ef3a268615' > (a7d36f3881f943a00000000000000002). > 2023-07-06 10:23:47,518 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager > [] - Clearing resource requirements of job a7d36f3881f943a00000000000000002 > 2023-07-06 10:23:47,517 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager > [] - Received resource requirements from job > a7d36f3881f943a00000000000000002: > [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, > numberOfRequiredSlots=1}] > 2023-07-06 10:23:47,517 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager > [] - Received resource requirements from job > a7d36f3881f943a00000000000000002: > [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, > numberOfRequiredSlots=2}] > 2023-07-06 10:23:47,516 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager > [] - Received resource requirements from job > a7d36f3881f943a00000000000000002: > [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, > numberOfRequiredSlots=3}] > 2023-07-06 10:23:47,463 INFO > org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] > - Submitting Job with JobId=a7d36f3881f943a00000000000000002. > 2023-07-06 10:23:47,463 INFO > org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] > - Job a7d36f3881f943a00000000000000002 is submitted. > 2023-07-06 10:23:47,104 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job > aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 > (a7d36f3881f943a00000000000000002) switched from state RUNNING to RESTARTING. > 2023-07-06 10:23:46,804 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Offer > reserved slots to the leader of job a7d36f3881f943a00000000000000002. > 2023-07-06 10:23:46,804 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Establish > JobManager connection for job a7d36f3881f943a00000000000000002. > 2023-07-06 10:23:46,799 INFO > org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Successful > registration at job manager > akka.tcp://flink@10.0.11.158:6123/user/rpc/jobmanager_2 for job > a7d36f3881f943a00000000000000002. > 2023-07-06 10:23:46,577 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Receive > slot request 221b24b50413805c9e35d7620b8a00b8 for job > a7d36f3881f943a00000000000000002 from resource manager with leader id > aaa9331f70b07a195b5f09d57d1b40c5. > 2023-07-06 10:23:46,577 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Receive > slot request 49d3c8cd1080bd38c0144c3d3cc597cd for job > a7d36f3881f943a00000000000000002 from resource manager with leader id > aaa9331f70b07a195b5f09d57d1b40c5. > 2023-07-06 10:23:46,577 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Receive > slot request 819f34cc8957066478fb4b3549367d24 for job > a7d36f3881f943a00000000000000002 from resource manager with leader id > aaa9331f70b07a195b5f09d57d1b40c5. > 2023-07-06 10:23:46,574 INFO > org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Add job > a7d36f3881f943a00000000000000002 for job leader monitoring. > 2023-07-06 10:23:46,570 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Receive > slot request 36802a7de1487f3fb1b6a3b509bd5e20 for job > a7d36f3881f943a00000000000000002 from resource manager with leader id > aaa9331f70b07a195b5f09d57d1b40c5. > 2023-07-06 10:23:46,560 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager > [] - Received resource requirements from job > a7d36f3881f943a00000000000000002: > [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, > numberOfRequiredSlots=4}] > 2023-07-06 10:23:46,556 INFO > org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - > Registered job manager > aaa9331f70b07a195b5f09d57d1b4...@akka.tcp://flink@10.0.11.158:6123/user/rpc/jobmanager_2 > for job a7d36f3881f943a00000000000000002. > 2023-07-06 10:23:46,528 INFO > org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - > Registering job manager > aaa9331f70b07a195b5f09d57d1b4...@akka.tcp://flink@10.0.11.158:6123/user/rpc/jobmanager_2 > for job a7d36f3881f943a00000000000000002. > 2023-07-06 10:23:46,480 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job > aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 > (a7d36f3881f943a00000000000000002) switched from state CREATED to RUNNING. > 2023-07-06 10:23:46,476 INFO > org.apache.flink.runtime.jobmaster.JobMaster [] - Starting > execution of job 'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615' > (a7d36f3881f943a00000000000000002) under job master id > aaa9331f70b07a195b5f09d57d1b40c5. > 2023-07-06 10:23:46,466 INFO > org.apache.flink.runtime.jobmaster.JobMaster [] - Using > failover strategy > org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@62877000 > for aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 > (a7d36f3881f943a00000000000000002). > 2023-07-06 10:23:46,079 INFO > org.apache.flink.runtime.jobmaster.JobMaster [] - Running > initialization on master for job > aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 > (a7d36f3881f943a00000000000000002). > 2023-07-06 10:23:46,059 INFO > org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] - > Found 0 checkpoints in > KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-a7d36f3881f943a00000000000000002-config-map'}. > 2023-07-06 10:23:46,051 INFO > org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] - > Recovering checkpoints from > KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-a7d36f3881f943a00000000000000002-config-map'}. > 2023-07-06 10:23:46,006 INFO > org.apache.flink.runtime.jobmaster.JobMaster [] - Using > restart back off time strategy > ExponentialDelayRestartBackoffTimeStrategy(initialBackoffMS=1000, > maxBackoffMS=300000, backoffMultiplier=2.0, resetBackoffThresholdMS=3600000, > jitterFactor=0.5, currentBackoffMS=1000, lastFailureTimestamp=0) for > aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 > (a7d36f3881f943a00000000000000002). > 2023-07-06 10:23:45,987 INFO > org.apache.flink.runtime.jobmaster.JobMaster [] - > Initializing job 'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615' > (a7d36f3881f943a00000000000000002). > 2023-07-06 10:23:45,966 INFO > org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Received > JobGraph submission > 'aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615' > (a7d36f3881f943a00000000000000002). > 2023-07-06 10:23:45,965 INFO > org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Received > JobGraph submission 'aletsch_mat_e5730831db8092adb12f5189c4c895ef3a268615' > (a7d36f3881f943a00000000000000002). > 2023-07-06 10:23:45,915 INFO > org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Added > JobGraph(jobId: a7d36f3881f943a00000000000000002) to > KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-cluster-config-map'}. > 2023-07-06 10:23:45,859 INFO > org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Submitting > job 'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615' > (a7d36f3881f943a00000000000000002). > 2023-07-06 10:23:45,857 INFO > org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Received > JobGraph submission 'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615' > (a7d36f3881f943a00000000000000002). > 2023-07-06 10:23:45,705 INFO > org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] > - Submitting Job with JobId=a7d36f3881f943a00000000000000002. > 2023-07-06 10:23:45,705 INFO > org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] > - Job a7d36f3881f943a00000000000000002 is submitted. > 2023-07-06 10:23:45,705 INFO > org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] > - Submitting Job with JobId=a7d36f3881f943a00000000000000002. > 2023-07-06 10:23:45,705 INFO > org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] > - Job a7d36f3881f943a00000000000000002 is submitted. > 2023-07-06 10:23:45,705 INFO > org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] > - Submitting Job with JobId=a7d36f3881f943a00000000000000002. > 2023-07-06 10:23:45,705 INFO > org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] > - Job a7d36f3881f943a00000000000000002 is submitted. > Flink Operator > 2023-07-06 10:26:25,792 o.a.f.k.o.s.AbstractFlinkService [INFO > ][aelps-staging/aletsch-mat-staging-e5730831] Submitting job: > a7d36f3881f943a00000000000000002 to session cluster. > 2023-07-06 10:25:05,163 o.a.f.k.o.s.AbstractFlinkService [INFO > ][aelps-staging/aletsch-mat-staging-e5730831] Submitting job: > a7d36f3881f943a00000000000000002 to session cluster. > 2023-07-06 10:24:24,553 o.a.f.k.o.s.AbstractFlinkService [INFO > ][aelps-staging/aletsch-mat-staging-e5730831] Submitting job: > a7d36f3881f943a00000000000000002 to session cluster. > 2023-07-06 10:24:03,850 o.a.f.k.o.s.AbstractFlinkService [INFO > ][aelps-staging/aletsch-mat-staging-e5730831] Submitting job: > a7d36f3881f943a00000000000000002 to session cluster. > 2023-07-06 10:23:53,094 o.a.f.k.o.s.AbstractFlinkService [INFO > ][aelps-staging/aletsch-mat-staging-e5730831] Submitting job: > a7d36f3881f943a00000000000000002 to session cluster. > 2023-07-06 10:23:47,346 o.a.f.k.o.s.AbstractFlinkService [INFO > ][aelps-staging/aletsch-mat-staging-e5730831] Submitting job: > a7d36f3881f943a00000000000000002 to session cluster. > 2023-07-06 10:23:45,372 o.a.f.k.o.s.AbstractFlinkService [INFO > ][aelps-staging/aletsch-mat-staging-e5730831] Submitting job: > a7d36f3881f943a00000000000000002 to session cluster. > {code} > > JobID: a1221c743367497b0000000000000002 > {code:bash} > Flink Cluster > 2023-07-06 11:23:48,062 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed > checkpoint 1 for job a1221c743367497b0000000000000002 (48548 bytes, > checkpointDuration=107 ms, finalizationTime=33 ms). > 2023-07-06 11:23:47,937 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering > checkpoint 1 (type=CheckpointType{name='Checkpoint', > sharingFilesStrategy=FORWARD_BACKWARD}) @ 1688635427922 for job > a1221c743367497b0000000000000002. > 2023-07-06 10:23:48,567 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Offer > reserved slots to the leader of job a1221c743367497b0000000000000002. > 2023-07-06 10:23:48,567 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Establish > JobManager connection for job a1221c743367497b0000000000000002. > 2023-07-06 10:23:48,567 INFO > org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Successful > registration at job manager > akka.tcp://flink@10.0.11.158:6123/user/rpc/jobmanager_7 for job > a1221c743367497b0000000000000002. > 2023-07-06 10:23:48,009 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Receive > slot request cae6932e2409d5fece3f6b4636e3c71a for job > a1221c743367497b0000000000000002 from resource manager with leader id > aaa9331f70b07a195b5f09d57d1b40c5. > 2023-07-06 10:23:48,003 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Receive > slot request 8a57f3ecff07d300aebb33f6b3545aed for job > a1221c743367497b0000000000000002 from resource manager with leader id > aaa9331f70b07a195b5f09d57d1b40c5. > 2023-07-06 10:23:48,003 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Receive > slot request 7a4a0cfd16eec4a1cb043cce5f989db0 for job > a1221c743367497b0000000000000002 from resource manager with leader id > aaa9331f70b07a195b5f09d57d1b40c5. > 2023-07-06 10:23:48,002 INFO > org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Add job > a1221c743367497b0000000000000002 for job leader monitoring. > 2023-07-06 10:23:48,002 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Receive > slot request 92cbc64513fa703e4acf28bbb3088a58 for job > a1221c743367497b0000000000000002 from resource manager with leader id > aaa9331f70b07a195b5f09d57d1b40c5. > 2023-07-06 10:23:48,999 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager > [] - Received resource requirements from job > a1221c743367497b0000000000000002: > [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, > numberOfRequiredSlots=4}] > 2023-07-06 10:23:47,998 INFO > org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - > Registered job manager > aaa9331f70b07a195b5f09d57d1b4...@akka.tcp://flink@10.0.11.158:6123/user/rpc/jobmanager_7 > for job a1221c743367497b0000000000000002. > 2023-07-06 10:23:47,953 INFO > org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - > Registering job manager > aaa9331f70b07a195b5f09d57d1b4...@akka.tcp://flink@10.0.11.158:6123/user/rpc/jobmanager_7 > for job a1221c743367497b0000000000000002. > 2023-07-06 10:23:47,922 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job > aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 > (a1221c743367497b0000000000000002) switched from state CREATED to RUNNING. > 2023-07-06 10:23:47,887 INFO > org.apache.flink.runtime.jobmaster.JobMaster [] - Starting > execution of job 'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615' > (a1221c743367497b0000000000000002) under job master id > aaa9331f70b07a195b5f09d57d1b40c5. > 2023-07-06 10:23:47,887 INFO > org.apache.flink.runtime.jobmaster.JobMaster [] - Using > failover strategy > org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@2222ba4d > for aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 > (a1221c743367497b0000000000000002). > 2023-07-06 10:23:47,880 INFO > org.apache.flink.runtime.jobmaster.JobMaster [] - Running > initialization on master for job > aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 > (a1221c743367497b0000000000000002). > 2023-07-06 10:23:47,872 INFO > org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] - > Found 0 checkpoints in > KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-a1221c743367497b0000000000000002-config-map'}. > 2023-07-06 10:23:47,867 INFO > org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] - > Recovering checkpoints from > KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-a1221c743367497b0000000000000002-config-map'}. > 2023-07-06 10:23:47,832 INFO > org.apache.flink.runtime.jobmaster.JobMaster [] - Using > restart back off time strategy > ExponentialDelayRestartBackoffTimeStrategy(initialBackoffMS=1000, > maxBackoffMS=300000, backoffMultiplier=2.0, resetBackoffThresholdMS=3600000, > jitterFactor=0.5, currentBackoffMS=1000, lastFailureTimestamp=0) for > aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615 > (a1221c743367497b0000000000000002). > 2023-07-06 10:23:47,832 INFO > org.apache.flink.runtime.jobmaster.JobMaster [] - > Initializing job 'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615' > (a1221c743367497b0000000000000002). > 2023-07-06 10:23:47,820 INFO > org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Added > JobGraph(jobId: a1221c743367497b0000000000000002) to > KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-cluster-config-map'}. > 2023-07-06 10:23:47,780 INFO > org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Submitting > job 'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615' > (a1221c743367497b0000000000000002). > 2023-07-06 10:23:47,776 INFO > org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Received > JobGraph submission 'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615' > (a1221c743367497b0000000000000002). > 2023-07-06 10:23:47,668 INFO > org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] > - Submitting Job with JobId=a1221c743367497b0000000000000002. > 2023-07-06 10:23:47,668 INFO > org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] > - Job a1221c743367497b0000000000000002 is submitted. > Flink Operator > 2023-07-06 10:23:48,007 o.a.f.k.o.s.AbstractFlinkService [INFO > ][aelps-staging/aletsch-smc-staging-e5730831] Submitted job: > a1221c743367497b0000000000000002 to session cluster. > 2023-07-06 10:23:47,505 o.a.f.k.o.s.AbstractFlinkService [INFO > ][aelps-staging/aletsch-smc-staging-e5730831] Submitting job: > a1221c743367497b0000000000000002 to session cluster. > 2023-07-06 10:23:45,416 o.a.f.k.o.s.AbstractFlinkService [INFO > ][aelps-staging/aletsch-smc-staging-e5730831] Submitting job: > a1221c743367497b0000000000000002 to session cluster. > {code} > JobID: e692c2dfaa18441c0000000000000002 > {code:bash} > Flink Cluster > 2023-07-06 11:23:48,004 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed > checkpoint 1 for job e692c2dfaa18441c0000000000000002 (8194 bytes, > checkpointDuration=125 ms, finalizationTime=28 ms). > 2023-07-06 11:23:47,867 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering > checkpoint 1 (type=CheckpointType{name='Checkpoint', > sharingFilesStrategy=FORWARD_BACKWARD}) @ 1688635427851 for job > e692c2dfaa18441c0000000000000002. > 2023-07-06 10:23:48,568 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Offer > reserved slots to the leader of job e692c2dfaa18441c0000000000000002. > 2023-07-06 10:23:48,568 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Establish > JobManager connection for job e692c2dfaa18441c0000000000000002. > 2023-07-06 10:23:48,568 INFO > org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Successful > registration at job manager > akka.tcp://flink@10.0.11.158:6123/user/rpc/jobmanager_6 for job > e692c2dfaa18441c0000000000000002. > 2023-07-06 10:23:48,002 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Receive > slot request 5e5a0e55fac280bf31abf29a20bce684 for job > e692c2dfaa18441c0000000000000002 from resource manager with leader id > aaa9331f70b07a195b5f09d57d1b40c5. > 2023-07-06 10:23:48,002 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Receive > slot request 1cdbce54f4376a1df86430f97dab6858 for job > e692c2dfaa18441c0000000000000002 from resource manager with leader id > aaa9331f70b07a195b5f09d57d1b40c5. > 2023-07-06 10:23:48,002 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Receive > slot request 352db7288d0e4d1775d5f52dd14c769d for job > e692c2dfaa18441c0000000000000002 from resource manager with leader id > aaa9331f70b07a195b5f09d57d1b40c5. > 2023-07-06 10:23:48,001 INFO > org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Add job > e692c2dfaa18441c0000000000000002 for job leader monitoring. > 2023-07-06 10:23:48,000 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Receive > slot request bffed3e4a4c8573049a4119bd7e15f19 for job > e692c2dfaa18441c0000000000000002 from resource manager with leader id > aaa9331f70b07a195b5f09d57d1b40c5. > 2023-07-06 10:23:48,998 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager > [] - Received resource requirements from job > e692c2dfaa18441c0000000000000002: > [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, > numberOfRequiredSlots=4}] > 2023-07-06 10:23:47,998 INFO > org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - > Registered job manager > aaa9331f70b07a195b5f09d57d1b4...@akka.tcp://flink@10.0.11.158:6123/user/rpc/jobmanager_6 > for job e692c2dfaa18441c0000000000000002. > 2023-07-06 10:23:47,953 INFO > org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - > Registering job manager > aaa9331f70b07a195b5f09d57d1b4...@akka.tcp://flink@10.0.11.158:6123/user/rpc/jobmanager_6 > for job e692c2dfaa18441c0000000000000002. > 2023-07-06 10:23:47,851 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job > aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615 > (e692c2dfaa18441c0000000000000002) switched from state CREATED to RUNNING. > 2023-07-06 10:23:47,845 INFO > org.apache.flink.runtime.jobmaster.JobMaster [] - Starting > execution of job 'aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615' > (e692c2dfaa18441c0000000000000002) under job master id > aaa9331f70b07a195b5f09d57d1b40c5. > 2023-07-06 10:23:47,844 INFO > org.apache.flink.runtime.jobmaster.JobMaster [] - Using > failover strategy > org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@7eeab246 > for aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615 > (e692c2dfaa18441c0000000000000002). > 2023-07-06 10:23:47,834 INFO > org.apache.flink.runtime.jobmaster.JobMaster [] - Running > initialization on master for job > aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615 > (e692c2dfaa18441c0000000000000002). > 2023-07-06 10:23:47,825 INFO > org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] - > Found 0 checkpoints in > KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-e692c2dfaa18441c0000000000000002-config-map'}. > 2023-07-06 10:23:47,813 INFO > org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] - > Recovering checkpoints from > KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-e692c2dfaa18441c0000000000000002-config-map'}. > 2023-07-06 10:23:47,782 INFO > org.apache.flink.runtime.jobmaster.JobMaster [] - Using > restart back off time strategy > ExponentialDelayRestartBackoffTimeStrategy(initialBackoffMS=1000, > maxBackoffMS=300000, backoffMultiplier=2.0, resetBackoffThresholdMS=3600000, > jitterFactor=0.5, currentBackoffMS=1000, lastFailureTimestamp=0) for > aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615 > (e692c2dfaa18441c0000000000000002). > 2023-07-06 10:23:47,781 INFO > org.apache.flink.runtime.jobmaster.JobMaster [] - > Initializing job 'aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615' > (e692c2dfaa18441c0000000000000002). > 2023-07-06 10:23:47,774 INFO > org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Added > JobGraph(jobId: e692c2dfaa18441c0000000000000002) to > KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-cluster-config-map'}. > 2023-07-06 10:23:47,703 INFO > org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Submitting > job 'aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615' > (e692c2dfaa18441c0000000000000002). > 2023-07-06 10:23:47,702 INFO > org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Received > JobGraph submission > 'aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615' > (e692c2dfaa18441c0000000000000002). > 2023-07-06 10:23:47,650 INFO > org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] > - Submitting Job with JobId=e692c2dfaa18441c0000000000000002. > 2023-07-06 10:23:47,650 INFO > org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] > - Job e692c2dfaa18441c0000000000000002 is submitted. > Flink Operator > 2023-07-06 10:23:47,973 o.a.f.k.o.s.AbstractFlinkService [INFO > ][aelps-staging/aletsch-wp-wafer-staging-e5730831] Submitted job: > e692c2dfaa18441c0000000000000002 to session cluster. > 2023-07-06 10:23:47,505 o.a.f.k.o.s.AbstractFlinkService [INFO > ][aelps-staging/aletsch-wp-wafer-staging-e5730831] Submitting job: > e692c2dfaa18441c0000000000000002 to session cluster. > 2023-07-06 10:23:45,374 o.a.f.k.o.s.AbstractFlinkService [INFO > ][aelps-staging/aletsch-wp-wafer-staging-e5730831] Submitting job: > e692c2dfaa18441c0000000000000002 to session cluster. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)