Hello,

We are facing an issue with some of our applications that are submitting a high 
volume of jobs to Flink (we are using v1.9.2). We are observing that numerous 
jobs (in this case 44 out of 350+) fail with the same FlinkJobNotFoundException 
within a 45 second timeframe.

>From our client logs, this is the exception we can see:


Calc Engine: Caused by: org.apache.flink.runtime.rest.util.RestClientException: 
[org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find 
Flink job (d0991f0ae712a9df710aa03311a32c8c)]

Calc Engine:   at 
org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:389)

Calc Engine:   at 
org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:373)

Calc Engine:   at 
java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952)

Calc Engine:   at 
java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)

Calc Engine:   at 
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)

Calc Engine:   ... 3 more


This is the first job to fail with the above exception. From the JobManager 
logs, we can see that the job goes to FINISHED State, and then we see the 
following exception:

2021-09-28 04:54:16,936 INFO  [flink-akka.actor.default-dispatcher-28] 
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job Flink Java 
Job at Tue Sep 28 04:48:21 EDT 2021 (d0991f0ae712a9df710aa03311a32c8c) switched 
from state RUNNING to FINISHED.
2021-09-28 04:54:16,937 INFO  [flink-akka.actor.default-dispatcher-28] 
org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Job 
d0991f0ae712a9df710aa03311a32c8c reached globally terminal state FINISHED.
2021-09-28 04:54:16,939 INFO  [flink-akka.actor.default-dispatcher-28] 
org.apache.flink.runtime.jobmaster.JobMaster                  - Stopping the 
JobMaster for job Flink Java Job at Tue Sep 28 04:48:21 EDT 
2021(d0991f0ae712a9df710aa03311a32c8c).
2021-09-28 04:54:16,940 INFO  [flink-akka.actor.default-dispatcher-39] 
org.apache.flink.yarn.YarnResourceManager                     - Disconnect job 
manager 
00000000000000000000000000000...@akka.tcp://fl...@d43723-714.dc.gs.com:44887/user/jobmanager_392<mailto:00000000000000000000000000000...@akka.tcp://fl...@d43723-714.dc.gs.com:44887/user/jobmanager_392>
 for job d0991f0ae712a9df710aa03311a32c8c from the resource manager.
2021-09-28 04:54:18,256 ERROR [flink-akka.actor.default-dispatcher-91] 
org.apache.flink.runtime.rest.handler.job.JobExecutionResultHandler  - 
Exception occurred in REST handler: 
org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find 
Flink job (d0991f0ae712a9df710aa03311a32c8c)


Here are the relevant logs from the TaskManager. We can see that the 
JobLeaderService tries to reconnect to the job. Any ideas as to why it is 
trying to reconnect?:

2021-09-28 04:54:13,382 INFO  [flink-akka.actor.default-dispatcher-16] 
org.apache.flink.runtime.taskexecutor.TaskExecutor            - Receive slot 
request b26c04706fd5aad03dfdca8691f1bf1c for job 
d0991f0ae712a9df710aa03311a32c8c from resource manager with leader id 
00000000000000000000000000000000.

2021-09-28 04:54:13,383 INFO  [flink-akka.actor.default-dispatcher-16] 
org.apache.flink.runtime.taskexecutor.JobLeaderService        - Add job 
d0991f0ae712a9df710aa03311a32c8c for job leader monitoring.

2021-09-28 04:54:13,397 INFO  [flink-akka.actor.default-dispatcher-16] 
org.apache.flink.runtime.taskexecutor.JobLeaderService        - Successful 
registration at job manager 
akka.tcp://fl...@d43723-714.dc.gs.com:44887/user/jobmanager_392 for job 
d0991f0ae712a9df710aa03311a32c8c.

2021-09-28 04:54:13,397 INFO  [flink-akka.actor.default-dispatcher-16] 
org.apache.flink.runtime.taskexecutor.TaskExecutor            - Establish 
JobManager connection for job d0991f0ae712a9df710aa03311a32c8c.

2021-09-28 04:54:13,397 INFO  [flink-akka.actor.default-dispatcher-16] 
org.apache.flink.runtime.taskexecutor.TaskExecutor            - Offer reserved 
slots to the leader of job d0991f0ae712a9df710aa03311a32c8c.

2021-09-28 04:54:13,405 INFO  [CHAIN DataSource (settl_delivery_type_code | 
DistCp | Sourcing Files) -> FlatMap (settl_delivery_type_code | DistCp | 
Copying Batch Data) (1/1)] org.apache.flink.runtime.blob.BlobClient             
         - Downloading 
d0991f0ae712a9df710aa03311a32c8c/p-54dfdc41d9a995e5b75eb9eb29bcac91725fc425-be2bc5df9ef6ce331e8019cf32eb222b
 from d43723-714.dc.gs.com/10.175.239.171:38726

2021-09-28 04:54:16,942 INFO  [flink-akka.actor.default-dispatcher-3] 
org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable      - Free slot 
TaskSlot(index:0, state:ACTIVE, resource profile: 
ResourceProfile{cpuCores=1.7976931348623157E308, heapMemoryInMB=2147483647, 
directMemoryInMB=2147483647, nativeMemoryInMB=2147483647, 
networkMemoryInMB=2147483647, managedMemoryInMB=9318}, allocationId: 
b26c04706fd5aad03dfdca8691f1bf1c, jobId: d0991f0ae712a9df710aa03311a32c8c).

2021-09-28 04:54:16,942 INFO  [flink-akka.actor.default-dispatcher-3] 
org.apache.flink.runtime.taskexecutor.JobLeaderService        - Remove job 
d0991f0ae712a9df710aa03311a32c8c from job leader monitoring.

2021-09-28 04:54:16,942 INFO  [flink-akka.actor.default-dispatcher-3] 
org.apache.flink.runtime.taskexecutor.TaskExecutor            - Close 
JobManager connection for job d0991f0ae712a9df710aa03311a32c8c.

2021-09-28 04:54:16,945 INFO  [flink-akka.actor.default-dispatcher-3] 
org.apache.flink.runtime.taskexecutor.TaskExecutor            - Close 
JobManager connection for job d0991f0ae712a9df710aa03311a32c8c.

2021-09-28 04:54:16,945 INFO  [flink-akka.actor.default-dispatcher-3] 
org.apache.flink.runtime.taskexecutor.JobLeaderService        - Cannot 
reconnect to job d0991f0ae712a9df710aa03311a32c8c because it is not registered.



We have seen that increasing driver memory helps alleviate the issue, but for 
certain high volume applications, we still see the same exception, even with 
50G driver memory. We would like to avoid increasing driver memory any further, 
if at all possible. We are unsure of the root cause, and any input would be 
greatly appreciated.

Please let me know if any more information is required.


Best,
Doug

_______________________________________________________________
Doug Gusick
Data Architecture
Goldman Sachs
30 Hudson St| Jersey City, 07302
Tel: +1 212 357 2640
email: doug.gus...@gs.com<mailto:doug.gus...@gs.com>
_____________________________________________________________
The Goldman Sachs Group, Inc. All rights reserved.
See http://www.gs.com/disclaimer/global_email for important risk disclosures, 
conflicts of interest and other terms and conditions relating to this e-mail 
and your reliance on information contained in it.  This message may contain 
confidential or privileged information.  If you are not the intended recipient, 
please advise us immediately and delete this message.  See 
http://www.gs.com/disclaimer/email for further information on confidentiality 
and the risks of non-secure electronic communication.  If you cannot access 
these links, please notify us by reply message and we will send the contents to 
you.


________________________________

Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>

Reply via email to