Re: Flink in HA mode causing JM Failure

Shammon FY Fri, 07 Jul 2023 01:30:54 -0700

Hi amenreet,

Maybe you can try to use hdfs or s3 for `high-availability.storageDir`, I
found your current job is using a local file which is started with
`file:///`.


Best,
Shammon FY


On Fri, Jul 7, 2023 at 4:20 PM amenreet sodhi <amenso...@gmail.com> wrote:

> Hi All,
> I am deploying Flink cluster on Kubernetes in HA mode. But i noticed,
> whenever i deploy Flink cluster for first time on K8s cluster, it is not
> able to populate the cluster configmap, and due to which JM fails with the
> following exception:
>
> 2023-07-06 16:46:11,428 ERROR 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Fatal error 
> occurred in the cluster entrypoint.
> java.util.concurrent.CompletionException: java.lang.IllegalStateException: 
> The base directory of the JobResultStore isn't accessible. No dirty 
> JobResults can be restored.
>       at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314)
>  ~[?:?]
>       at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319)
>  [?:?]
>       at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1702)
>  [?:?]
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  [?:?]
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  [?:?]
>       at java.lang.Thread.run(Thread.java:834) [?:?]
> Caused by: java.lang.IllegalStateException: The base directory of the 
> JobResultStore isn't accessible. No dirty JobResults can be restored.
>       at 
> org.apache.flink.util.Preconditions.checkState(Preconditions.java:193) 
> ~[event_executor-1.1.20.jar:?]
>       at 
> org.apache.flink.runtime.highavailability.FileSystemJobResultStore.getDirtyResultsInternal(FileSystemJobResultStore.java:182)
>  ~[event_executor-1.1.20.jar:?]
>       at 
> org.apache.flink.runtime.highavailability.AbstractThreadsafeJobResultStore.withReadLock(AbstractThreadsafeJobResultStore.java:118)
>  ~[event_executor-1.1.20.jar:?]
>       at 
> org.apache.flink.runtime.highavailability.AbstractThreadsafeJobResultStore.getDirtyResults(AbstractThreadsafeJobResultStore.java:100)
>  ~[event_executor-1.1.20.jar:?]
>       at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.getDirtyJobResults(SessionDispatcherLeaderProcess.java:194)
>  ~[event_executor-1.1.20.jar:?]
>       at 
> org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198)
>  ~[event_executor-1.1.20.jar:?]
>       at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.getDirtyJobResultsIfRunning(SessionDispatcherLeaderProcess.java:188)
>  ~[event_executor-1.1.20.jar:?]
>       at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
>
> Once we reinstall/helm upgrade then this exception goes away. How can this
> be resolved, any additional configuration required to resolve this?
>
> I am using the following configuration for HA:
>
>  high-availability.storageDir: file:///opt/flink/pm/ha
>     kubernetes.cluster-id: {{ include "fullname" . }}-cluster-{{ now | date 
> "20060102150405" }}
>     high-availability.jobmanager.port: 6123
>     high-availability.type: kubernetes
>     high-availability: 
> org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
>     kubernetes.namespace: {{ .Release.Namespace }}
>
> Thanks
>
> Regards
> Amenreet Singh Sodhi
>
>

Re: Flink in HA mode causing JM Failure

Reply via email to