Re: Task manager errors with Flink ZooKeeper High Availability

Koffman, Noa (Nokia - IL/Kfar Sava) Thu, 17 Feb 2022 05:35:54 -0800

Thanks,
I understand that the functionality isn’t affected, this is very good news.
But is there a way to either skip this check or skip logging it? We see it in 
our log more the 400 times per task manager.
It would be very helpful if the log level could be reduced, or the check could 
be skipped? Is there any way to achieve this?


Thanks
Noa

From: Chesnay Schepler <ches...@apache.org>
Date: Thursday, 17 February 2022 at 15:00
To: Koffman, Noa (Nokia - IL/Kfar Sava) <noa.koff...@nokia.com>, Yun Gao 
<yungao...@aliyun.com>, user <user@flink.apache.org>
Subject: Re: Task manager errors with Flink ZooKeeper High Availability
Everything is fine.

The TM tries to retrieve the jar (aka, the blob), and there is a fast path to 
access it directly from storage. This fails (because it has no access to it), 
and then falls back to retrieving it from the JM.

On 17/02/2022 13:49, Koffman, Noa (Nokia - IL/Kfar Sava) wrote:
Hi,
Thanks for your reply,
Please see below the full stack trace, and the log message right after, it 
looks like it is trying to download via BlobClient after failing to download 
from store, as you have suggested.
My question is, is there a way to avoid this attempt to copy from blob store? 
Is my configuration of task manager wrong?
Currently we are using the same flink-conf.yaml file for both job manager and 
task managers, which include the high-availability configuration mentioned 
below, should these be remove from the task managers?


2022-02-17 07:19:45,408 INFO  org.apache.flink.runtime.blob.PermanentBlobCache  
           [] - Failed to copy from blob store. Downloading from BLOB server 
instead.
java.io.FileNotFoundException: 
/flink_state/default/blob/job_0ddba6dd21053567981e11bda8da7c8e/blob_p-4ff16c9e641ba8803e55d62a6ab2f6d05512373e-92d38bfb5719024adf4c72b086184d76
 (No such file or directory)
                at java.io.FileInputStream.open0(Native Method) ~[?:?]
                at java.io.FileInputStream.open(Unknown Source) ~[?:?]
                at java.io.FileInputStream.<init>(Unknown Source) ~[?:?]
                at 
org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50)
 ~[flink-dist_2.11-1.13.5.jar:1.13.5]
                at 
org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:134) 
~[flink-dist_2.11-1.13.5.jar:1.13.5]
                at 
org.apache.flink.runtime.blob.FileSystemBlobStore.get(FileSystemBlobStore.java:106)
 ~[flink-dist_2.11-1.13.5.jar:1.13.5]
                at 
org.apache.flink.runtime.blob.FileSystemBlobStore.get(FileSystemBlobStore.java:88)
 ~[flink-dist_2.11-1.13.5.jar:1.13.5]
                at 
org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:145)
 [flink-dist_2.11-1.13.5.jar:1.13.5]
                at 
org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:187)
 [flink-dist_2.11-1.13.5.jar:1.13.5]
                at 
org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.createUserCodeClassLoader(BlobLibraryCacheManager.java:251)
 [flink-dist_2.11-1.13.5.jar:1.13.5]
                at 
org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.getOrResolveClassLoader(BlobLibraryCacheManager.java:228)
 [flink-dist_2.11-1.13.5.jar:1.13.5]
                at 
org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.access$1100(BlobLibraryCacheManager.java:199)
 [flink-dist_2.11-1.13.5.jar:1.13.5]
                at 
org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$DefaultClassLoaderLease.getOrResolveClassLoader(BlobLibraryCacheManager.java:333)
 [flink-dist_2.11-1.13.5.jar:1.13.5]
                at 
org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:1024)
 [flink-dist_2.11-1.13.5.jar:1.13.5]
                at 
org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:628) 
[flink-dist_2.11-1.13.5.jar:1.13.5]
                at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566) 
[flink-dist_2.11-1.13.5.jar:1.13.5]
                at java.lang.Thread.run(Unknown Source) [?:?]
2022-02-17 07:19:45,408 INFO  org.apache.flink.runtime.blob.BlobClient          
           [] - Downloading 
0ddba6dd21053567981e11bda8da7c8e/p-4ff16c9e641ba8803e55d62a6ab2f6d05512373e-92d38bfb5719024adf4c72b086184d76
 from 
10-1-10-213.noa-edge-infra-flink-jobmanager.noa-edge.svc.cluster.local/10.1.10.213:6124


Thanks
Noa


From: Yun Gao <yungao...@aliyun.com><mailto:yungao...@aliyun.com>
Date: Thursday, 17 February 2022 at 14:04
To: Koffman, Noa (Nokia - IL/Kfar Sava) 
<noa.koff...@nokia.com><mailto:noa.koff...@nokia.com>, user 
<user@flink.apache.org><mailto:user@flink.apache.org>
Subject: Re: Task manager errors with Flink ZooKeeper High Availability
Hi Koffman,

>From TM side the only possible usage come to me is that or components like 
>BlobCache, which is used to
transfer jars or large task informations between JM and TM. But specially for 
BlobService, if it failed to find
the file it would turn to JM via http connection. If convenient could you also 
post the stack of the exception
and may I have a double confirmation whether the job could still running 
normally with this exception?

Sorry that I might miss something~

Best,
Yun



------------------Original Mail ------------------
Sender:Koffman, Noa (Nokia - IL/Kfar Sava) 
<noa.koff...@nokia.com><mailto:noa.koff...@nokia.com>
Send Date:Thu Feb 17 05:00:42 2022
Recipients:user <user@flink.apache.org><mailto:user@flink.apache.org>
Subject:Task manager errors with Flink ZooKeeper High Availability

Hi,
We are currently running flink in session deployment on k8s cluster, with 1 
job-manager and 3 task-managers
To support recovery from job-manager failure, following a different mail thread,
We have enabled zookeeper high availability using a k8s Persistent Volume

To achieve this, we’ve added these conf values:
    high-availability: zookeeper
    high-availability.zookeeper.quorum: zk-noa-edge-infra:2181
    high-availability.zookeeper.path.root: /flink
    high-availability.storageDir: /flink_state
    high-availability.jobmanager.port: 6150
for the storageDir, we are using a k8s persistent volume with ReadWriteOnce

Recovery of job-manager failure is working now, but it looks like there are 
issues with the task-managers:
The same configuration file is used in the task-managers as well, and there are 
a lot of error in the task-manager’s logs –
java.io.FileNotFoundException: 
/flink_state/flink/blob/job_9f4be579c7ab79817e25ed56762b7623/blob_p-5cf39313e388d9120c235528672fd267105be0e0-938e4347a98aa6166dc2625926fdab56
 (No such file or directory)

It seems that the task-managers are trying to access the job-manager’s storage 
dir – can this be avoided?
The task manager does not have access to the job manager persistent volume – is 
this mandatory?
If we don’t have the option to use shared storage, is there a way to make 
zookeeper hold and manage the job states, instead of using the shared storage?

Thanks
Noa

Re: Task manager errors with Flink ZooKeeper High Availability

Reply via email to