Flink in HA mode causing JM Failure

amenreet sodhi Fri, 07 Jul 2023 01:20:46 -0700

Hi All,
I am deploying Flink cluster on Kubernetes in HA mode. But i noticed,
whenever i deploy Flink cluster for first time on K8s cluster, it is not
able to populate the cluster configmap, and due to which JM fails with the
following exception:


2023-07-06 16:46:11,428 ERROR
org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] -
Fatal error occurred in the cluster entrypoint.
java.util.concurrent.CompletionException:
java.lang.IllegalStateException: The base directory of the
JobResultStore isn't accessible. No dirty JobResults can be restored.
        at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314)
~[?:?]
        at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319)
[?:?]
        at 
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1702)
[?:?]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
[?:?]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
[?:?]
        at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: java.lang.IllegalStateException: The base directory of the
JobResultStore isn't accessible. No dirty JobResults can be restored.
        at 
org.apache.flink.util.Preconditions.checkState(Preconditions.java:193)
~[event_executor-1.1.20.jar:?]
        at 
org.apache.flink.runtime.highavailability.FileSystemJobResultStore.getDirtyResultsInternal(FileSystemJobResultStore.java:182)
~[event_executor-1.1.20.jar:?]
        at 
org.apache.flink.runtime.highavailability.AbstractThreadsafeJobResultStore.withReadLock(AbstractThreadsafeJobResultStore.java:118)
~[event_executor-1.1.20.jar:?]
        at 
org.apache.flink.runtime.highavailability.AbstractThreadsafeJobResultStore.getDirtyResults(AbstractThreadsafeJobResultStore.java:100)
~[event_executor-1.1.20.jar:?]
        at 
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.getDirtyJobResults(SessionDispatcherLeaderProcess.java:194)
~[event_executor-1.1.20.jar:?]
        at 
org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198)
~[event_executor-1.1.20.jar:?]
        at 
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.getDirtyJobResultsIfRunning(SessionDispatcherLeaderProcess.java:188)
~[event_executor-1.1.20.jar:?]
        at 
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
~[?:?]

Once we reinstall/helm upgrade then this exception goes away. How can this
be resolved, any additional configuration required to resolve this?

I am using the following configuration for HA:

 high-availability.storageDir: file:///opt/flink/pm/ha
    kubernetes.cluster-id: {{ include "fullname" . }}-cluster-{{ now |
date "20060102150405" }}
    high-availability.jobmanager.port: 6123
    high-availability.type: kubernetes
    high-availability:
org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
    kubernetes.namespace: {{ .Release.Namespace }}

Thanks

Regards
Amenreet Singh Sodhi

Flink in HA mode causing JM Failure

Reply via email to