Re: Task manager errors with Flink ZooKeeper High Availability

Chesnay Schepler Thu, 17 Feb 2022 05:51:22 -0800

You could reduce the log level of the PermanentBlobCache to WARN via theLog4j configuration.

I think you could even filter this specific message with Log4j.


On 17/02/2022 14:35, Koffman, Noa (Nokia - IL/Kfar Sava) wrote:


Thanks,

I understand that the functionality isn’t affected, this is very goodnews.

But is there a way to either skip this check or skip logging it? Wesee it in our log more the 400 times per task manager.

It would be very helpful if the log level could be reduced, or thecheck could be skipped? Is there any way to achieve this?


Thanks

Noa

*From: *Chesnay Schepler <ches...@apache.org>
*Date: *Thursday, 17 February 2022 at 15:00

*To: *Koffman, Noa (Nokia - IL/Kfar Sava) <noa.koff...@nokia.com>, YunGao <yungao...@aliyun.com>, user <user@flink.apache.org>

*Subject: *Re: Task manager errors with Flink ZooKeeper High Availability

Everything is fine.

The TM tries to retrieve the jar (aka, the blob), and there is a fastpath to access it directly from storage. This fails (because it has noaccess to it), and then falls back to retrieving it from the JM.


On 17/02/2022 13:49, Koffman, Noa (Nokia - IL/Kfar Sava) wrote:

    Hi,

    Thanks for your reply,

    Please see below the full stack trace, and the log message right
    after, it looks like it is trying to download via BlobClient after
    failing to download from store, as you have suggested.

    My question is, is there a way to avoid this attempt to copy from
    blob store? Is my configuration of task manager wrong?

    Currently we are using the same flink-conf.yaml file for both job
    manager and task managers, which include the high-availability
    configuration mentioned below, should these be remove from the
    task managers?

    /2022-02-17 07:19:45,408 INFO
    org.apache.flink.runtime.blob.PermanentBlobCache [] - Failed to
    copy from blob store. Downloading from BLOB server instead./

    /java.io.FileNotFoundException:
    
/flink_state/default/blob/job_0ddba6dd21053567981e11bda8da7c8e/blob_p-4ff16c9e641ba8803e55d62a6ab2f6d05512373e-92d38bfb5719024adf4c72b086184d76
    (No such file or directory)/

    /at java.io.FileInputStream.open0(Native Method) ~[?:?]/

    /at java.io.FileInputStream.open(Unknown Source) ~[?:?]/

    /at java.io.FileInputStream.<init>(Unknown Source) ~[?:?]/

    /at
    
org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50)
    ~[flink-dist_2.11-1.13.5.jar:1.13.5]/

    /at
    
org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:134)
    ~[flink-dist_2.11-1.13.5.jar:1.13.5]/

    /at
    
org.apache.flink.runtime.blob.FileSystemBlobStore.get(FileSystemBlobStore.java:106)
    ~[flink-dist_2.11-1.13.5.jar:1.13.5]/

    /at
    
org.apache.flink.runtime.blob.FileSystemBlobStore.get(FileSystemBlobStore.java:88)
    ~[flink-dist_2.11-1.13.5.jar:1.13.5]/

    /at
    
org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:145)
    [flink-dist_2.11-1.13.5.jar:1.13.5]/

    /at
    
org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:187)
    [flink-dist_2.11-1.13.5.jar:1.13.5]/

    /at
    
org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.createUserCodeClassLoader(BlobLibraryCacheManager.java:251)
    [flink-dist_2.11-1.13.5.jar:1.13.5]/

    /at
    
org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.getOrResolveClassLoader(BlobLibraryCacheManager.java:228)
    [flink-dist_2.11-1.13.5.jar:1.13.5]/

    /at
    
org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.access$1100(BlobLibraryCacheManager.java:199)
    [flink-dist_2.11-1.13.5.jar:1.13.5]/

    /at
    
org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$DefaultClassLoaderLease.getOrResolveClassLoader(BlobLibraryCacheManager.java:333)
    [flink-dist_2.11-1.13.5.jar:1.13.5]/

    /at
    
org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:1024)
    [flink-dist_2.11-1.13.5.jar:1.13.5]/

    /at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:628)
    [flink-dist_2.11-1.13.5.jar:1.13.5]/

    /at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566)
    [flink-dist_2.11-1.13.5.jar:1.13.5]/

    /at java.lang.Thread.run(Unknown Source) [?:?]/

    /2022-02-17 07:19:45,408 INFO
    org.apache.flink.runtime.blob.BlobClient              [] -
    Downloading
    
0ddba6dd21053567981e11bda8da7c8e/p-4ff16c9e641ba8803e55d62a6ab2f6d05512373e-92d38bfb5719024adf4c72b086184d76
    from
    
10-1-10-213.noa-edge-infra-flink-jobmanager.noa-edge.svc.cluster.local/10.1.10.213:6124/

    //

    Thanks

    Noa

    *From: *Yun Gao <yungao...@aliyun.com> <mailto:yungao...@aliyun.com>
    *Date: *Thursday, 17 February 2022 at 14:04
    *To: *Koffman, Noa (Nokia - IL/Kfar Sava) <noa.koff...@nokia.com>
    <mailto:noa.koff...@nokia.com>, user <user@flink.apache.org>
    <mailto:user@flink.apache.org>
    *Subject: *Re: Task manager errors with Flink ZooKeeper High
    Availability

    Hi Koffman,

    From TM side the only possible usage come to me is that or
    components like BlobCache, which is used to

    transfer jars or large task informations between JM and TM. But
    specially for BlobService, if it failed to find

    the file it would turn to JM via http connection. If convenient
    could you also post the stack of the exception

    and may I have a double confirmation whether the job could still
    running normally with this exception?

    Sorry that I might miss something~

    Best,

    Yun

        ------------------Original Mail ------------------

        *Sender:*Koffman, Noa (Nokia - IL/Kfar Sava)
        <noa.koff...@nokia.com> <mailto:noa.koff...@nokia.com>

        *Send Date:*Thu Feb 17 05:00:42 2022

        *Recipients:*user <user@flink.apache.org>
        <mailto:user@flink.apache.org>

        *Subject:*Task manager errors with Flink ZooKeeper High
        Availability

            Hi,

            We are currently running flink in session deployment on
            k8s cluster, with 1 job-manager and 3 task-managers

            To support recovery from job-manager failure, following a
            different mail thread,

            We have enabled zookeeper high availability using a k8s
            Persistent Volume

            To achieve this, we’ve added these conf values:

            /    high-availability: zookeeper/

            /high-availability.zookeeper.quorum: zk-noa-edge-infra:2181/

            /high-availability.zookeeper.path.root: /flink/

            /high-availability.storageDir: /flink_state/

            /high-availability.jobmanager.port: 6150/

            for the storageDir, we are using a k8s persistent volume
            with ReadWriteOnce

            Recovery of job-manager failure is working now, but it
            looks like there are issues with the task-managers:

            The same configuration file is used in the task-managers
            as well, and there are a lot of error in the
            task-manager’s logs –

            java.io.FileNotFoundException:
            
/flink_state/flink/blob/job_9f4be579c7ab79817e25ed56762b7623/blob_p-5cf39313e388d9120c235528672fd267105be0e0-938e4347a98aa6166dc2625926fdab56
            (No such file or directory)

            It seems that the task-managers are trying to access the
            job-manager’s storage dir – can this be avoided?

            The task manager does not have access to the job manager
            persistent volume – is this mandatory?

            If we don’t have the option to use shared storage, is
            there a way to make zookeeper hold and manage the job
            states, instead of using the shared storage?

            Thanks

            Noa

Re: Task manager errors with Flink ZooKeeper High Availability

Reply via email to