Re: Task manager errors with Flink ZooKeeper High Availability

Sigalit Eliazov Wed, 23 Feb 2022 06:29:15 -0800

Thanks for the response on this issue.
with the same configuration defined

*    high-availability: zookeeper*


*    high-availability.zookeeper.quorum: zk-noa-edge-infra:2181*

*    high-availability.zookeeper.path.root: /flink*

*    high-availability.storageDir: /flink_state*

*    high-availability.jobmanager.port: 6124*

for the storageDir, we are using a k8s persistent volume with ReadWriteOnce


Currently we are facing a new issue. the task managers are able to register
to the job manager via this address *xxxx/xxxx:6124*

 but when trying to run tasks we get an error from a BlobClient


java.io.IOException: Failed to fetch BLOB
e78e9574da4f5e4bdbc8de9678ebfb36/p-650534cd619de1069630141f1dcc9876d6ce2ce0-ee11ae52caa20ff81909708a783fd596
from xxxx/xxxx:xxxx and store it under
/hadoop/yarn/local/usercache/hdfs/appcache/application_1570784539965_0165/blobStore-79420f3a-6a83-40d4-8058-f01686a1ced8/incoming/temp-00000072
    at
org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:169)
    at
org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181)
    at
org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:202)
    at
org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120)
    at
org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:915)
    at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:595)
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Could not connect to BlobServer at address
*xxxx/xxxx:6124*
    at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:100)
    at
org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:143)
    ... 7 more
Caused by: java.net.SocketException: xxxx
    at java.net.Socket.createImpl(Socket.java:460)
    at java.net.Socket.connect(Socket.java:587)
    at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:95)


We see these errors both on the task manager and on the job manager.

regarding the job manager: does he access himself also via some external
service in order to fetch the blob?

Thanks

Sigalit

On Thu, Feb 17, 2022 at 3:51 PM Chesnay Schepler <ches...@apache.org> wrote:

> You could reduce the log level of the PermanentBlobCache to WARN via the
> Log4j configuration.
> I think you could even filter this specific message with Log4j.
>
> On 17/02/2022 14:35, Koffman, Noa (Nokia - IL/Kfar Sava) wrote:
>
> Thanks,
>
> I understand that the functionality isn’t affected, this is very good news.
>
> But is there a way to either skip this check or skip logging it? We see it
> in our log more the 400 times per task manager.
>
> It would be very helpful if the log level could be reduced, or the check
> could be skipped? Is there any way to achieve this?
>
>
>
> Thanks
>
> Noa
>
>
>
> *From: *Chesnay Schepler <ches...@apache.org> <ches...@apache.org>
> *Date: *Thursday, 17 February 2022 at 15:00
> *To: *Koffman, Noa (Nokia - IL/Kfar Sava) <noa.koff...@nokia.com>
> <noa.koff...@nokia.com>, Yun Gao <yungao...@aliyun.com>
> <yungao...@aliyun.com>, user <user@flink.apache.org>
> <user@flink.apache.org>
> *Subject: *Re: Task manager errors with Flink ZooKeeper High Availability
>
> Everything is fine.
>
>
>
> The TM tries to retrieve the jar (aka, the blob), and there is a fast path
> to access it directly from storage. This fails (because it has no access to
> it), and then falls back to retrieving it from the JM.
>
>
>
> On 17/02/2022 13:49, Koffman, Noa (Nokia - IL/Kfar Sava) wrote:
>
> Hi,
>
> Thanks for your reply,
>
> Please see below the full stack trace, and the log message right after, it
> looks like it is trying to download via BlobClient after failing to
> download from store, as you have suggested.
>
> My question is, is there a way to avoid this attempt to copy from blob
> store? Is my configuration of task manager wrong?
>
> Currently we are using the same flink-conf.yaml file for both job manager
> and task managers, which include the high-availability configuration
> mentioned below, should these be remove from the task managers?
>
>
>
>
>
> *2022-02-17 07:19:45,408 INFO
> org.apache.flink.runtime.blob.PermanentBlobCache             [] - Failed to
> copy from blob store. Downloading from BLOB server instead.*
>
> *java.io.FileNotFoundException:
> /flink_state/default/blob/job_0ddba6dd21053567981e11bda8da7c8e/blob_p-4ff16c9e641ba8803e55d62a6ab2f6d05512373e-92d38bfb5719024adf4c72b086184d76
> (No such file or directory)*
>
> *                at java.io.FileInputStream.open0(Native Method) ~[?:?]*
>
> *                at java.io.FileInputStream.open(Unknown Source) ~[?:?]*
>
> *                at java.io.FileInputStream.<init>(Unknown Source) ~[?:?]*
>
> *                at
> org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50)
> ~[flink-dist_2.11-1.13.5.jar:1.13.5]*
>
> *                at
> org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:134)
> ~[flink-dist_2.11-1.13.5.jar:1.13.5]*
>
> *                at
> org.apache.flink.runtime.blob.FileSystemBlobStore.get(FileSystemBlobStore.java:106)
> ~[flink-dist_2.11-1.13.5.jar:1.13.5]*
>
> *                at
> org.apache.flink.runtime.blob.FileSystemBlobStore.get(FileSystemBlobStore.java:88)
> ~[flink-dist_2.11-1.13.5.jar:1.13.5]*
>
> *                at
> org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:145)
> [flink-dist_2.11-1.13.5.jar:1.13.5]*
>
> *                at
> org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:187)
> [flink-dist_2.11-1.13.5.jar:1.13.5]*
>
> *                at
> org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.createUserCodeClassLoader(BlobLibraryCacheManager.java:251)
> [flink-dist_2.11-1.13.5.jar:1.13.5]*
>
> *                at
> org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.getOrResolveClassLoader(BlobLibraryCacheManager.java:228)
> [flink-dist_2.11-1.13.5.jar:1.13.5]*
>
> *                at
> org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.access$1100(BlobLibraryCacheManager.java:199)
> [flink-dist_2.11-1.13.5.jar:1.13.5]*
>
> *                at
> org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$DefaultClassLoaderLease.getOrResolveClassLoader(BlobLibraryCacheManager.java:333)
> [flink-dist_2.11-1.13.5.jar:1.13.5]*
>
> *                at
> org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:1024)
> [flink-dist_2.11-1.13.5.jar:1.13.5]*
>
> *                at
> org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:628)
> [flink-dist_2.11-1.13.5.jar:1.13.5]*
>
> *                at
> org.apache.flink.runtime.taskmanager.Task.run(Task.java:566)
> [flink-dist_2.11-1.13.5.jar:1.13.5]*
>
> *                at java.lang.Thread.run(Unknown Source) [?:?]*
>
> *2022-02-17 07:19:45,408 INFO
> org.apache.flink.runtime.blob.BlobClient                     [] -
> Downloading
> 0ddba6dd21053567981e11bda8da7c8e/p-4ff16c9e641ba8803e55d62a6ab2f6d05512373e-92d38bfb5719024adf4c72b086184d76
> from
> 10-1-10-213.noa-edge-infra-flink-jobmanager.noa-edge.svc.cluster.local/10.1.10.213:6124
> <http://10.1.10.213:6124>*
>
>
>
>
>
> Thanks
>
> Noa
>
>
>
>
>
> *From: *Yun Gao <yungao...@aliyun.com> <yungao...@aliyun.com>
> *Date: *Thursday, 17 February 2022 at 14:04
> *To: *Koffman, Noa (Nokia - IL/Kfar Sava) <noa.koff...@nokia.com>
> <noa.koff...@nokia.com>, user <user@flink.apache.org>
> <user@flink.apache.org>
> *Subject: *Re: Task manager errors with Flink ZooKeeper High Availability
>
> Hi Koffman,
>
>
>
> From TM side the only possible usage come to me is that or components like
> BlobCache, which is used to
>
> transfer jars or large task informations between JM and TM. But specially
> for BlobService, if it failed to find
>
> the file it would turn to JM via http connection. If convenient could you
> also post the stack of the exception
>
> and may I have a double confirmation whether the job could still running
> normally with this exception?
>
>
>
> Sorry that I might miss something~
>
>
>
> Best,
>
> Yun
>
>
>
>
>
>
>
> ------------------Original Mail ------------------
>
> *Sender:*Koffman, Noa (Nokia - IL/Kfar Sava) <noa.koff...@nokia.com>
> <noa.koff...@nokia.com>
>
> *Send Date:*Thu Feb 17 05:00:42 2022
>
> *Recipients:*user <user@flink.apache.org> <user@flink.apache.org>
>
> *Subject:*Task manager errors with Flink ZooKeeper High Availability
>
>
>
> Hi,
>
> We are currently running flink in session deployment on k8s cluster, with
> 1 job-manager and 3 task-managers
>
> To support recovery from job-manager failure, following a different mail
> thread,
>
> We have enabled zookeeper high availability using a k8s Persistent Volume
>
>
>
> To achieve this, we’ve added these conf values:
>
> *    high-availability: zookeeper*
>
> *    high-availability.zookeeper.quorum: zk-noa-edge-infra:2181*
>
> *    high-availability.zookeeper.path.root: /flink*
>
> *    high-availability.storageDir: /flink_state*
>
> *    high-availability.jobmanager.port: 6150*
>
> for the storageDir, we are using a k8s persistent volume with ReadWriteOnce
>
>
>
> Recovery of job-manager failure is working now, but it looks like there
> are issues with the task-managers:
>
> The same configuration file is used in the task-managers as well, and
> there are a lot of error in the task-manager’s logs –
>
> java.io.FileNotFoundException:
> /flink_state/flink/blob/job_9f4be579c7ab79817e25ed56762b7623/blob_p-5cf39313e388d9120c235528672fd267105be0e0-938e4347a98aa6166dc2625926fdab56
> (No such file or directory)
>
>
>
> It seems that the task-managers are trying to access the job-manager’s
> storage dir – can this be avoided?
>
> The task manager does not have access to the job manager persistent volume
> – is this mandatory?
>
> If we don’t have the option to use shared storage, is there a way to make
> zookeeper hold and manage the job states, instead of using the shared
> storage?
>
>
>
> Thanks
>
> Noa
>
>
>
>
>
>
>
>
>

Re: Task manager errors with Flink ZooKeeper High Availability

Reply via email to