Thanks for the response on this issue. with the same configuration defined * high-availability: zookeeper*
* high-availability.zookeeper.quorum: zk-noa-edge-infra:2181* * high-availability.zookeeper.path.root: /flink* * high-availability.storageDir: /flink_state* * high-availability.jobmanager.port: 6124* for the storageDir, we are using a k8s persistent volume with ReadWriteOnce Currently we are facing a new issue. the task managers are able to register to the job manager via this address *xxxx/xxxx:6124* but when trying to run tasks we get an error from a BlobClient java.io.IOException: Failed to fetch BLOB e78e9574da4f5e4bdbc8de9678ebfb36/p-650534cd619de1069630141f1dcc9876d6ce2ce0-ee11ae52caa20ff81909708a783fd596 from xxxx/xxxx:xxxx and store it under /hadoop/yarn/local/usercache/hdfs/appcache/application_1570784539965_0165/blobStore-79420f3a-6a83-40d4-8058-f01686a1ced8/incoming/temp-00000072 at org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:169) at org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181) at org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:202) at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120) at org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:915) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:595) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Could not connect to BlobServer at address *xxxx/xxxx:6124* at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:100) at org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:143) ... 7 more Caused by: java.net.SocketException: xxxx at java.net.Socket.createImpl(Socket.java:460) at java.net.Socket.connect(Socket.java:587) at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:95) We see these errors both on the task manager and on the job manager. regarding the job manager: does he access himself also via some external service in order to fetch the blob? Thanks Sigalit On Thu, Feb 17, 2022 at 3:51 PM Chesnay Schepler <ches...@apache.org> wrote: > You could reduce the log level of the PermanentBlobCache to WARN via the > Log4j configuration. > I think you could even filter this specific message with Log4j. > > On 17/02/2022 14:35, Koffman, Noa (Nokia - IL/Kfar Sava) wrote: > > Thanks, > > I understand that the functionality isn’t affected, this is very good news. > > But is there a way to either skip this check or skip logging it? We see it > in our log more the 400 times per task manager. > > It would be very helpful if the log level could be reduced, or the check > could be skipped? Is there any way to achieve this? > > > > Thanks > > Noa > > > > *From: *Chesnay Schepler <ches...@apache.org> <ches...@apache.org> > *Date: *Thursday, 17 February 2022 at 15:00 > *To: *Koffman, Noa (Nokia - IL/Kfar Sava) <noa.koff...@nokia.com> > <noa.koff...@nokia.com>, Yun Gao <yungao...@aliyun.com> > <yungao...@aliyun.com>, user <user@flink.apache.org> > <user@flink.apache.org> > *Subject: *Re: Task manager errors with Flink ZooKeeper High Availability > > Everything is fine. > > > > The TM tries to retrieve the jar (aka, the blob), and there is a fast path > to access it directly from storage. This fails (because it has no access to > it), and then falls back to retrieving it from the JM. > > > > On 17/02/2022 13:49, Koffman, Noa (Nokia - IL/Kfar Sava) wrote: > > Hi, > > Thanks for your reply, > > Please see below the full stack trace, and the log message right after, it > looks like it is trying to download via BlobClient after failing to > download from store, as you have suggested. > > My question is, is there a way to avoid this attempt to copy from blob > store? Is my configuration of task manager wrong? > > Currently we are using the same flink-conf.yaml file for both job manager > and task managers, which include the high-availability configuration > mentioned below, should these be remove from the task managers? > > > > > > *2022-02-17 07:19:45,408 INFO > org.apache.flink.runtime.blob.PermanentBlobCache [] - Failed to > copy from blob store. Downloading from BLOB server instead.* > > *java.io.FileNotFoundException: > /flink_state/default/blob/job_0ddba6dd21053567981e11bda8da7c8e/blob_p-4ff16c9e641ba8803e55d62a6ab2f6d05512373e-92d38bfb5719024adf4c72b086184d76 > (No such file or directory)* > > * at java.io.FileInputStream.open0(Native Method) ~[?:?]* > > * at java.io.FileInputStream.open(Unknown Source) ~[?:?]* > > * at java.io.FileInputStream.<init>(Unknown Source) ~[?:?]* > > * at > org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50) > ~[flink-dist_2.11-1.13.5.jar:1.13.5]* > > * at > org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:134) > ~[flink-dist_2.11-1.13.5.jar:1.13.5]* > > * at > org.apache.flink.runtime.blob.FileSystemBlobStore.get(FileSystemBlobStore.java:106) > ~[flink-dist_2.11-1.13.5.jar:1.13.5]* > > * at > org.apache.flink.runtime.blob.FileSystemBlobStore.get(FileSystemBlobStore.java:88) > ~[flink-dist_2.11-1.13.5.jar:1.13.5]* > > * at > org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:145) > [flink-dist_2.11-1.13.5.jar:1.13.5]* > > * at > org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:187) > [flink-dist_2.11-1.13.5.jar:1.13.5]* > > * at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.createUserCodeClassLoader(BlobLibraryCacheManager.java:251) > [flink-dist_2.11-1.13.5.jar:1.13.5]* > > * at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.getOrResolveClassLoader(BlobLibraryCacheManager.java:228) > [flink-dist_2.11-1.13.5.jar:1.13.5]* > > * at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.access$1100(BlobLibraryCacheManager.java:199) > [flink-dist_2.11-1.13.5.jar:1.13.5]* > > * at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$DefaultClassLoaderLease.getOrResolveClassLoader(BlobLibraryCacheManager.java:333) > [flink-dist_2.11-1.13.5.jar:1.13.5]* > > * at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:1024) > [flink-dist_2.11-1.13.5.jar:1.13.5]* > > * at > org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:628) > [flink-dist_2.11-1.13.5.jar:1.13.5]* > > * at > org.apache.flink.runtime.taskmanager.Task.run(Task.java:566) > [flink-dist_2.11-1.13.5.jar:1.13.5]* > > * at java.lang.Thread.run(Unknown Source) [?:?]* > > *2022-02-17 07:19:45,408 INFO > org.apache.flink.runtime.blob.BlobClient [] - > Downloading > 0ddba6dd21053567981e11bda8da7c8e/p-4ff16c9e641ba8803e55d62a6ab2f6d05512373e-92d38bfb5719024adf4c72b086184d76 > from > 10-1-10-213.noa-edge-infra-flink-jobmanager.noa-edge.svc.cluster.local/10.1.10.213:6124 > <http://10.1.10.213:6124>* > > > > > > Thanks > > Noa > > > > > > *From: *Yun Gao <yungao...@aliyun.com> <yungao...@aliyun.com> > *Date: *Thursday, 17 February 2022 at 14:04 > *To: *Koffman, Noa (Nokia - IL/Kfar Sava) <noa.koff...@nokia.com> > <noa.koff...@nokia.com>, user <user@flink.apache.org> > <user@flink.apache.org> > *Subject: *Re: Task manager errors with Flink ZooKeeper High Availability > > Hi Koffman, > > > > From TM side the only possible usage come to me is that or components like > BlobCache, which is used to > > transfer jars or large task informations between JM and TM. But specially > for BlobService, if it failed to find > > the file it would turn to JM via http connection. If convenient could you > also post the stack of the exception > > and may I have a double confirmation whether the job could still running > normally with this exception? > > > > Sorry that I might miss something~ > > > > Best, > > Yun > > > > > > > > ------------------Original Mail ------------------ > > *Sender:*Koffman, Noa (Nokia - IL/Kfar Sava) <noa.koff...@nokia.com> > <noa.koff...@nokia.com> > > *Send Date:*Thu Feb 17 05:00:42 2022 > > *Recipients:*user <user@flink.apache.org> <user@flink.apache.org> > > *Subject:*Task manager errors with Flink ZooKeeper High Availability > > > > Hi, > > We are currently running flink in session deployment on k8s cluster, with > 1 job-manager and 3 task-managers > > To support recovery from job-manager failure, following a different mail > thread, > > We have enabled zookeeper high availability using a k8s Persistent Volume > > > > To achieve this, we’ve added these conf values: > > * high-availability: zookeeper* > > * high-availability.zookeeper.quorum: zk-noa-edge-infra:2181* > > * high-availability.zookeeper.path.root: /flink* > > * high-availability.storageDir: /flink_state* > > * high-availability.jobmanager.port: 6150* > > for the storageDir, we are using a k8s persistent volume with ReadWriteOnce > > > > Recovery of job-manager failure is working now, but it looks like there > are issues with the task-managers: > > The same configuration file is used in the task-managers as well, and > there are a lot of error in the task-manager’s logs – > > java.io.FileNotFoundException: > /flink_state/flink/blob/job_9f4be579c7ab79817e25ed56762b7623/blob_p-5cf39313e388d9120c235528672fd267105be0e0-938e4347a98aa6166dc2625926fdab56 > (No such file or directory) > > > > It seems that the task-managers are trying to access the job-manager’s > storage dir – can this be avoided? > > The task manager does not have access to the job manager persistent volume > – is this mandatory? > > If we don’t have the option to use shared storage, is there a way to make > zookeeper hold and manage the job states, instead of using the shared > storage? > > > > Thanks > > Noa > > > > > > > > >