Hello folks, I'm seeing application failures where our Blobserver is refusing connections mid application:
2020-09-30 13:56:06,227 INFO [flink-akka.actor.default-dispatcher-18] org.apache.flink.runtime.taskexecutor.TaskExecutor - Un-registering task and sending final execution state FINISHED to JobManager for task DataSink (TextOutputFormat (hdfs:/user/p2epda/lake/delp_prod/PROD/APPROVED/staging/datastore/MandateTradingLine/tmp_pipeline_collapse) - UTF-8) 3d1890b47f4398d805cf0c1b54286f71. 2020-09-30 13:56:06,423 INFO [flink-akka.actor.default-dispatcher-18] org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable - Free slot TaskSlot(index:0, state:ACTIVE, resource profile: ResourceProfile{cpuCores=1.7976931348623157E308, heapMemoryInMB=2147483647, directMemoryInMB=2147483647, nativeMemoryInMB=2147483647, networkMemoryInMB=2147483647, managedMemoryInMB=3046}, allocationId: e8be16ec74f16c795d95b89cd08f5c37, jobId: e808de0373bd515224434b7ec1efe249). 2020-09-30 13:56:06,424 INFO [flink-akka.actor.default-dispatcher-18] org.apache.flink.runtime.taskexecutor.JobLeaderService - Remove job e808de0373bd515224434b7ec1efe249 from job leader monitoring. 2020-09-30 13:56:06,424 INFO [flink-akka.actor.default-dispatcher-18] org.apache.flink.runtime.taskexecutor.TaskExecutor - Close JobManager connection for job e808de0373bd515224434b7ec1efe249. 2020-09-30 13:56:06,426 INFO [flink-akka.actor.default-dispatcher-18] org.apache.flink.runtime.taskexecutor.TaskExecutor - Close JobManager connection for job e808de0373bd515224434b7ec1efe249. 2020-09-30 13:56:06,426 INFO [flink-akka.actor.default-dispatcher-18] org.apache.flink.runtime.taskexecutor.JobLeaderService - Cannot reconnect to job e808de0373bd515224434b7ec1efe249 because it is not registered. 2020-09-30 13:56:09,918 INFO [CHAIN DataSource (dataset | Read Staging From File System | AVRO) -> Map (Map at readAvroFileWithFilter(FlinkReadUtils.java:82)) -> Filter (Filter at validateData(DAXTask.java:97)) -> FlatMap (FlatMap at handleBloomFilter(PreMergeTask.java:187)) -> FlatMap (FlatMap at collapsePipelineIfRequired(Task.java:160)) (1/1)] org.apache.flink.runtime.blob.BlobClient - Downloading 48b8ba9f3de039f74483085f90e5ad64/p-0b27cb203799adb76d2a434ed3d64052d832cff3-6915d0cd0fef97f728cd890986b2bf39 from d43723-430.dc.gs.com/10.48.128.14:46473 (retry 3) 2020-09-30 13:56:09,920 ERROR [CHAIN DataSource (dataset | Read Staging From File System | AVRO) -> Map (Map at readAvroFileWithFilter(FlinkReadUtils.java:82)) -> Filter (Filter at validateData(DAXTask.java:97)) -> FlatMap (FlatMap at handleBloomFilter(PreMergeTask.java:187)) -> FlatMap (FlatMap at collapsePipelineIfRequired(Task.java:160)) (1/1)] org.apache.flink.runtime.blob.BlobClient - Failed to fetch BLOB 48b8ba9f3de039f74483085f90e5ad64/p-0b27cb203799adb76d2a434ed3d64052d832cff3-6915d0cd0fef97f728cd890986b2bf39 from d43723-430.dc.gs.com/10.48.128.14:46473 and store it under /fs/htmp/yarn/local/usercache/delp_prod/appcache/application_1599723434208_15328880/blobStore-e2888df1-c7be-4ce6-b6b6-58e7c24a140a/incoming/temp-00000004 Retrying... 2020-09-30 13:56:09,920 INFO [CHAIN DataSource (dataset | Read Staging From File System | AVRO) -> Map (Map at readAvroFileWithFilter(FlinkReadUtils.java:82)) -> Filter (Filter at validateData(DAXTask.java:97)) -> FlatMap (FlatMap at handleBloomFilter(PreMergeTask.java:187)) -> FlatMap (FlatMap at collapsePipelineIfRequired(Task.java:160)) (1/1)] org.apache.flink.runtime.blob.BlobClient - Downloading 48b8ba9f3de039f74483085f90e5ad64/p-0b27cb203799adb76d2a434ed3d64052d832cff3-6915d0cd0fef97f728cd890986b2bf39 from d43723-430.dc.gs.com/10.48.128.14:46473 (retry 4) 2020-09-30 13:56:09,922 ERROR [CHAIN DataSource (dataset | Read Staging From File System | AVRO) -> Map (Map at readAvroFileWithFilter(FlinkReadUtils.java:82)) -> Filter (Filter at validateData(DAXTask.java:97)) -> FlatMap (FlatMap at handleBloomFilter(PreMergeTask.java:187)) -> FlatMap (FlatMap at collapsePipelineIfRequired(Task.java:160)) (1/1)] org.apache.flink.runtime.blob.BlobClient - Failed to fetch BLOB 48b8ba9f3de039f74483085f90e5ad64/p-0b27cb203799adb76d2a434ed3d64052d832cff3-6915d0cd0fef97f728cd890986b2bf39 from d43723-430.dc.gs.com/10.48.128.14:46473 and store it under /fs/htmp/yarn/local/usercache/delp_prod/appcache/application_1599723434208_15328880/blobStore-e2888df1-c7be-4ce6-b6b6-58e7c24a140a/incoming/temp-00000004 Retrying... 2020-09-30 13:56:09,922 INFO [CHAIN DataSource (dataset | Read Staging From File System | AVRO) -> Map (Map at readAvroFileWithFilter(FlinkReadUtils.java:82)) -> Filter (Filter at validateData(DAXTask.java:97)) -> FlatMap (FlatMap at handleBloomFilter(PreMergeTask.java:187)) -> FlatMap (FlatMap at collapsePipelineIfRequired(Task.java:160)) (1/1)] org.apache.flink.runtime.blob.BlobClient - Downloading 48b8ba9f3de039f74483085f90e5ad64/p-0b27cb203799adb76d2a434ed3d64052d832cff3-6915d0cd0fef97f728cd890986b2bf39 from d43723-430.dc.gs.com/10.48.128.14:46473 (retry 5) 2020-09-30 13:56:09,925 ERROR [CHAIN DataSource (dataset | Read Staging From File System | AVRO) -> Map (Map at readAvroFileWithFilter(FlinkReadUtils.java:82)) -> Filter (Filter at validateData(DAXTask.java:97)) -> FlatMap (FlatMap at handleBloomFilter(PreMergeTask.java:187)) -> FlatMap (FlatMap at collapsePipelineIfRequired(Task.java:160)) (1/1)] org.apache.flink.runtime.blob.BlobClient - Failed to fetch BLOB 48b8ba9f3de039f74483085f90e5ad64/p-0b27cb203799adb76d2a434ed3d64052d832cff3-6915d0cd0fef97f728cd890986b2bf39 from d43723-430.dc.gs.com/10.48.128.14:46473 and store it under /fs/htmp/yarn/local/usercache/delp_prod/appcache/application_1599723434208_15328880/blobStore-e2888df1-c7be-4ce6-b6b6-58e7c24a140a/incoming/temp-00000004 No retries left. java.io.IOException: Could not connect to BlobServer at address d43723-430.dc.gs.com/10.48.128.14:46473 at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:100) at org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:143) at org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181) at org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:202) at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120) at org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:915) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:595) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530) at java.lang.Thread.run(Thread.java:745) Caused by: java.net.ConnectException: Connection refused (Connection refused) at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:95) ... 8 more Prior to the above connection refused error, I don't see any exceptions or failures. We're running this application with v1.9.2 on YARN, 26 Task Managers with 2 cores each, and with the default BLOB server configurations. The application itself then has many jobs it submits to the cluster. Does this sound like a blob.fetch.backlog/concurrent-connections config problem (defaulted to 1000 and 50 respectively)? I wasn't sure how chatty each TM is with the server. How can we tell if it's either a max concurrent-conn or backlog problem? Best, Andreas ________________________________ Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>