Re: Task manager errors with Flink ZooKeeper High Availability

Chesnay Schepler Thu, 17 Feb 2022 05:00:53 -0800

Everything is fine.

The TM tries to retrieve the jar (aka, the blob), and there is a fastpath to access it directly from storage. This fails (because it has noaccess to it), and then falls back to retrieving it from the JM.


On 17/02/2022 13:49, Koffman, Noa (Nokia - IL/Kfar Sava) wrote:

Hi,

Thanks for your reply,
Please see below the full stack trace, and the log message rightafter, it looks like it is trying to download via BlobClient afterfailing to download from store, as you have suggested.
My question is, is there a way to avoid this attempt to copy from blobstore? Is my configuration of task manager wrong?
Currently we are using the same flink-conf.yaml file for both jobmanager and task managers, which include the high-availabilityconfiguration mentioned below, should these be remove from the taskmanagers?
/2022-02-17 07:19:45,408 INFOorg.apache.flink.runtime.blob.PermanentBlobCache [] - Failed to copyfrom blob store. Downloading from BLOB server instead./
/java.io.FileNotFoundException:/flink_state/default/blob/job_0ddba6dd21053567981e11bda8da7c8e/blob_p-4ff16c9e641ba8803e55d62a6ab2f6d05512373e-92d38bfb5719024adf4c72b086184d76(No such file or directory)/
/at java.io.FileInputStream.open0(Native Method) ~[?:?]/

/at java.io.FileInputStream.open(Unknown Source) ~[?:?]/

/at java.io.FileInputStream.<init>(Unknown Source) ~[?:?]/
/atorg.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50)~[flink-dist_2.11-1.13.5.jar:1.13.5]/
/atorg.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:134)~[flink-dist_2.11-1.13.5.jar:1.13.5]/
/atorg.apache.flink.runtime.blob.FileSystemBlobStore.get(FileSystemBlobStore.java:106)~[flink-dist_2.11-1.13.5.jar:1.13.5]/
/atorg.apache.flink.runtime.blob.FileSystemBlobStore.get(FileSystemBlobStore.java:88)~[flink-dist_2.11-1.13.5.jar:1.13.5]/
/atorg.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:145)[flink-dist_2.11-1.13.5.jar:1.13.5]/
/atorg.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:187)[flink-dist_2.11-1.13.5.jar:1.13.5]/
/atorg.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.createUserCodeClassLoader(BlobLibraryCacheManager.java:251)[flink-dist_2.11-1.13.5.jar:1.13.5]/
/atorg.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.getOrResolveClassLoader(BlobLibraryCacheManager.java:228)[flink-dist_2.11-1.13.5.jar:1.13.5]/
/atorg.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.access$1100(BlobLibraryCacheManager.java:199)[flink-dist_2.11-1.13.5.jar:1.13.5]/
/atorg.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$DefaultClassLoaderLease.getOrResolveClassLoader(BlobLibraryCacheManager.java:333)[flink-dist_2.11-1.13.5.jar:1.13.5]/
/atorg.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:1024)[flink-dist_2.11-1.13.5.jar:1.13.5]/
/at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:628)[flink-dist_2.11-1.13.5.jar:1.13.5]/
/at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566)[flink-dist_2.11-1.13.5.jar:1.13.5]/
/at java.lang.Thread.run(Unknown Source) [?:?]/
/2022-02-17 07:19:45,408 INFO org.apache.flink.runtime.blob.BlobClient [] - Downloading0ddba6dd21053567981e11bda8da7c8e/p-4ff16c9e641ba8803e55d62a6ab2f6d05512373e-92d38bfb5719024adf4c72b086184d76from10-1-10-213.noa-edge-infra-flink-jobmanager.noa-edge.svc.cluster.local/10.1.10.213:6124/
//

Thanks

Noa

*From: *Yun Gao <yungao...@aliyun.com>
*Date: *Thursday, 17 February 2022 at 14:04
*To: *Koffman, Noa (Nokia - IL/Kfar Sava) <noa.koff...@nokia.com>,user <user@flink.apache.org>
*Subject: *Re: Task manager errors with Flink ZooKeeper High Availability

Hi Koffman,
From TM side the only possible usage come to me is that or componentslike BlobCache, which is used to
transfer jars or large task informations between JM and TM. Butspecially for BlobService, if it failed to find
the file it would turn to JM via http connection. If convenient couldyou also post the stack of the exception
and may I have a double confirmation whether the job could stillrunning normally with this exception?
Sorry that I might miss something~

Best,

Yun

    ------------------Original Mail ------------------

    *Sender:*Koffman, Noa (Nokia - IL/Kfar Sava) <noa.koff...@nokia.com>

    *Send Date:*Thu Feb 17 05:00:42 2022

    *Recipients:*user <user@flink.apache.org>

    *Subject:*Task manager errors with Flink ZooKeeper High Availability

        Hi,

        We are currently running flink in session deployment on k8s
        cluster, with 1 job-manager and 3 task-managers

        To support recovery from job-manager failure, following a
        different mail thread,

        We have enabled zookeeper high availability using a k8s
        Persistent Volume

        To achieve this, we’ve added these conf values:

        /    high-availability: zookeeper/

        /high-availability.zookeeper.quorum: zk-noa-edge-infra:2181/

        /high-availability.zookeeper.path.root: /flink/

        /high-availability.storageDir: /flink_state/

        /high-availability.jobmanager.port: 6150/

        for the storageDir, we are using a k8s persistent volume with
        ReadWriteOnce

        Recovery of job-manager failure is working now, but it looks
        like there are issues with the task-managers:

        The same configuration file is used in the task-managers as
        well, and there are a lot of error in the task-manager’s logs –

        java.io.FileNotFoundException:
        
/flink_state/flink/blob/job_9f4be579c7ab79817e25ed56762b7623/blob_p-5cf39313e388d9120c235528672fd267105be0e0-938e4347a98aa6166dc2625926fdab56
        (No such file or directory)

        It seems that the task-managers are trying to access the
        job-manager’s storage dir – can this be avoided?

        The task manager does not have access to the job manager
        persistent volume – is this mandatory?

        If we don’t have the option to use shared storage, is there a
        way to make zookeeper hold and manage the job states, instead
        of using the shared storage?

        Thanks

        Noa

Re: Task manager errors with Flink ZooKeeper High Availability

Reply via email to