Re: Flink’s Kubernetes HA services - NOT working

Matthias Pohl Mon, 15 Feb 2021 00:32:41 -0800

I'm adding the Flink user ML to the conversation again.

On Mon, Feb 15, 2021 at 8:18 AM Matthias Pohl <matth...@ververica.com>
wrote:


> Hi Omer,
> thanks for sharing the configuration. You're right: Using NFS for HA's
> storageDir is fine.
>
> About the error message you're referring to: I haven't worked with the HA
> k8s service, yet. But the RBAC is a good hint. Flink's native Kubernetes
> documentation [1] points out that you can use a custom service account.
> This one needs special permissions to start/stop pods automatically (which
> does not apply in your case) but also to access ConfigMaps. You might want
> to try setting the permission as described in [1].
>
> Best,
> Matthias
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/native_kubernetes.html#rbac
>
> On Sun, Feb 14, 2021 at 7:16 PM Omer Ozery <omeroz...@gmail.com> wrote:
>
>> Hey Matthias.
>> My name is Omer, i am Daniel's devops, i will elaborate about our flink
>> situation.
>> these our flink resource definitions, as they are generated using the
>> helm template command (minus log4j,metrics configuration and some sensitive
>> data)
>> ---
>> # Source: flink/templates/flink-configmap.yaml
>> apiVersion: v1
>> kind: ConfigMap
>> metadata:
>>   name: flink-config
>>   labels:
>>     app: flink
>> data:
>>   flink-conf.yaml: |
>>     jobmanager.rpc.address: flink-jobmanager
>>     jobmanager.rpc.port: 6123
>>     jobmanager.execution.failover-strategy: region
>>     jobmanager.memory.process.size: 8g
>>     taskmanager.memory.process.size: 24g
>>     taskmanager.memory.task.off-heap.size: 1g
>>     taskmanager.numberOfTaskSlots: 4
>>     queryable-state.proxy.ports: 6125
>>     queryable-state.enable: true
>>     blob.server.port: 6124
>>     parallelism.default: 1
>>     state.backend.incremental: true
>>     state.backend: rocksdb
>>     state.backend.rocksdb.localdir: /opt/flink/rocksdb
>>     state.checkpoints.dir: file:///opt/flink/checkpoints
>>     classloader.resolve-order: child-first
>>     kubernetes.cluster-id: flink-cluster
>>     kubernetes.namespace: intel360-beta
>>     high-availability:
>> org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
>>     high-availability.storageDir: file:///opt/flink/recovery
>>
>> ---
>> # Source: flink/templates/flink-service.yaml
>> apiVersion: v1
>> kind: Service
>> metadata:
>>   name: flink-jobmanager
>>   labels:
>>     {}
>> spec:
>>   ports:
>>   - name: http-ui
>>     port: 8081
>>     targetPort: http-ui
>>   - name: tcp-rpc
>>     port: 6123
>>     targetPort: tcp-rpc
>>   - name: tcp-blob
>>     port: 6124
>>     targetPort: tcp-blob
>>   selector:
>>     app: flink
>>     component: jobmanager
>> ---
>> # Source: flink/templates/flink-deployment.yaml
>> apiVersion: apps/v1
>> kind: Deployment
>> metadata:
>>   name: flink-jobmanager
>> spec:
>>   replicas: 1
>>   selector:
>>     matchLabels:
>>       app: flink
>>       component: jobmanager
>>   template:
>>     metadata:
>>       labels:
>>         app: flink
>>         component: jobmanager
>>       annotations:
>>         checksum/config:
>> f22b7a7c8a618b2f5257476737883s6a76a88c25b37e455257e120d8e4ed941
>>     spec:
>>       containers:
>>       - name: jobmanager
>>         image: flink:1.12.1-scala_2.11-java11
>>         args: [ "jobmanager" ]
>>         ports:
>>         - name: http-ui
>>           containerPort: 8081
>>         - name: tcp-rpc
>>           containerPort: 6123
>>         - name: tcp-blob
>>           containerPort: 6124
>>         resources:
>>           {}
>>         # Environment Variables
>>         env:
>>         - name: ENABLE_CHECKPOINTING
>>           value: "true"
>>         - name: JOB_MANAGER_RPC_ADDRESS
>>           value: "flink-jobmanager"
>>         volumeMounts:
>>         - name: flink-config
>>           mountPath: /opt/flink/conf/flink-conf.yaml
>>           subPath: flink-conf.yaml
>>         # NFS mounts
>>         - name: flink-checkpoints
>>           mountPath: "/opt/flink/checkpoints"
>>         - name: flink-recovery
>>           mountPath: "/opt/flink/recovery"
>>       volumes:
>>       - name: flink-config
>>         configMap:
>>           name: flink-config
>>       # NFS volumes
>>       - name: flink-checkpoints
>>         nfs:
>>           server: "my-nfs-server.my-org"
>>           path: "/my-shared-nfs-dir/flink/checkpoints"
>>       - name: flink-recovery
>>         nfs:
>>           server: "my-nfs-server.my-org"
>>           path: "/my-shared-nfs-dir/flink/recovery"
>> ---
>> # Source: flink/templates/flink-deployment.yaml
>> apiVersion: apps/v1
>> kind: Deployment
>> metadata:
>>   name: flink-taskmanager
>> spec:
>>   replicas: 7
>>   selector:
>>     matchLabels:
>>       app: flink
>>       component: taskmanager
>>   template:
>>     metadata:
>>       labels:
>>         app: flink
>>         component: taskmanager
>>       annotations:
>>         checksum/config:
>> f22b7a7c8a618b2f5257476737883s6a76a88c25b37e455257e120d8e4ed941
>>     spec:
>>       containers:
>>       - name: taskmanager
>>         image: flink:1.12.1-scala_2.11-java11
>>         args: [ "taskmanager" ]
>>         resources:
>>           limits:
>>             cpu: 6000m
>>             memory: 24Gi
>>           requests:
>>             cpu: 6000m
>>             memory: 24Gi
>>         # Environment Variables
>>         env:
>>         - name: ENABLE_CHECKPOINTING
>>           value: "true"
>>         - name: JOB_MANAGER_RPC_ADDRESS
>>           value: "flink-jobmanager"
>>         volumeMounts:
>>         - name: flink-config
>>           mountPath: /opt/flink/conf/flink-conf.yaml
>>           subPath: flink-conf.yaml
>>         # NFS mounts
>>         - name: flink-checkpoints
>>           mountPath: "/opt/flink/checkpoints"
>>         - name: flink-recovery
>>           mountPath: "/opt/flink/recovery"
>>       volumes:
>>       - name: flink-config
>>         configMap:
>>           name: flink-config
>>       # NFS volumes
>>       - name: flink-checkpoints
>>         nfs:
>>           server: "my-nfs-server.my-org"
>>           path: "/my-shared-nfs-dir/flink/checkpoints"
>>       - name: flink-recovery
>>         nfs:
>>           server: "my-nfs-server.my-org"
>>           path: "/my-shared-nfs-dir/flink/recovery"
>> ---
>> # Source: flink/templates/flink-ingress.yaml
>> apiVersion: extensions/v1beta1
>> kind: Ingress
>> metadata:
>>   name: jobmanager
>> spec:
>>   rules:
>>     - host: my.flink.job.manager.url
>>       http:
>>         paths:
>>           - path: /
>>             backend:
>>               serviceName: flink-jobmanager
>>               servicePort: 8081
>> ---
>>
>> as you can see we are using the skeleton of the standalone configuration
>> as it documented here:
>>
>> https://ci.apache.org/projects/flink/flink-docs-stable/deployment/resource-providers/standalone/kubernetes.html
>> with some per-company configuration obviously, but still under the scope
>> of this document..
>>
>> on a normal beautiful day and without the HA configuration, everything
>> works fine.
>> when trying to configure kubernetes HA using this document:
>>
>> https://ci.apache.org/projects/flink/flink-docs-stable/deployment/ha/kubernetes_ha.html
>> with the following parameters:
>>     kubernetes.cluster-id: flink-cluster
>>     high-availability:
>> org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
>>     high-availability.storageDir: file:///opt/flink/recovery
>>
>> the jobmanager fails with the following error:
>> 2021-02-14 16:57:19,103 ERROR
>> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] -
>> Exception occurred while acquiring lock 'ConfigMapLock: default -
>> flink-cluster-restserver-leader (54211907-eba9-47b1-813e-11f12ba89ccb)'
>> io.fabric8.kubernetes.client.KubernetesClientException: Failure
>> executing: GET at:
>> https://10.96.0.1/api/v1/namespaces/default/configmaps/flink-cluster-restserver-leader.
>> Message: Forbidden!Configured service account doesn't have access. Service
>> account may have been revoked. configmaps "flink-cluster-restserver-leader"
>> is forbidden: User "system:serviceaccount:intel360-beta:default" cannot get
>> resource "configmaps" in API group "" in the namespace "default".
>>
>> so we added this line as well (as you can see in the flink-config
>> configmap above)
>> kubernetes.namespace: intel360-beta
>> although it is not part of the document and i don't think flink should be
>> aware of the namespace it resides in, it damages the modularity of upper
>> layers of configurations, regardless we added it and then got the the
>> following error:
>>
>> 2021-02-14 17:00:57,086 ERROR
>> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] -
>> Exception occurred while acquiring lock 'ConfigMapLock: intel360-beta -
>> flink-cluster-restserver-leader (66180ce6-c62e-4ea4-9420-0a3134bef3d6)'
>> io.fabric8.kubernetes.client.KubernetesClientException: Failure
>> executing: GET at:
>> https://10.96.0.1/api/v1/namespaces/intel360-beta/configmaps/flink-cluster-restserver-leader.
>> Message: Forbidden!Configured service account doesn't have access. Service
>> account may have been revoked. configmaps "flink-cluster-restserver-leader"
>> is forbidden: User "system:serviceaccount:intel360-beta:default" cannot get
>> resource "configmaps" in API group "" in the namespace "intel360-beta".
>>
>> which is bassically the same error message just directed to the flink's
>> namespace.
>> my question is, do i need to add RBAC to the flink's service account,
>> because i got the impression from the flink official documents and some
>> blogs responses that it designed to function without any special
>> permissions.
>> if we do need RBAC can you give an official documentations reference of
>> the exact permissions.
>>
>> NOTE: as you can see our flink-checkpoints and recovery locations are
>> directed to a local directory mounted to a shared NFS between all tasks and
>> job manager, since our infrastructure is bare-metal by design. (although
>> this one is hosted in AWS)
>>
>> thanks in advance
>> Omer
>>
>>
>> ---------- Forwarded message ---------
>> From: Daniel Peled <daniel.peled.w...@gmail.com>
>> Date: Sun, Feb 14, 2021 at 6:18 PM
>> Subject: Fwd: Flink’s Kubernetes HA services - NOT working
>> To: <omeroz...@gmail.com>
>>
>>
>>
>>
>> ---------- Forwarded message ---------
>> מאת: Matthias Pohl <matth...@ververica.com>
>> ‪Date: יום ה׳, 11 בפבר׳ 2021 ב-17:37‬
>> Subject: Re: Flink’s Kubernetes HA services - NOT working
>> To: Matthias Pohl <matth...@ververica.com>
>> Cc: Daniel Peled <daniel.peled.w...@gmail.com>, user <
>> user@flink.apache.org>
>>
>>
>> One other thing: It looks like you've set high-availability.storageDir to
>> a local path file:///opt/flink/recovery. You should use a storage path that
>> is accessible from all Flink cluster components (e.g. using S3). Only
>> references are stored in Kubernetes ConfigMaps [1].
>>
>> Best,
>> Matthias
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/ha/kubernetes_ha.html#configuration
>>
>> On Wed, Feb 10, 2021 at 6:08 PM Matthias Pohl <matth...@ververica.com>
>> wrote:
>>
>>> Hi Daniel,
>>> what's the exact configuration you used? Did you use the resource
>>> definitions provided in the Standalone Flink on Kubernetes docs [1]? Did
>>> you do certain things differently in comparison to the documentation?
>>>
>>> Best,
>>> Matthias
>>>
>>> [1]
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/standalone/kubernetes.html#appendix
>>>
>>> On Wed, Feb 10, 2021 at 1:31 PM Daniel Peled <
>>> daniel.peled.w...@gmail.com> wrote:
>>>
>>>>
>>>> ,Hey
>>>>
>>>> We are using standalone flink on kubernetes
>>>> :"And we have followed the instructions in the following link
>>>> "Kubernetes HA Services
>>>>
>>>> https://ci.apache.org/projects/flink/flink-docs-stable/deployment/ha/kubernetes_ha.html
>>>> .We were unable to make it work
>>>> .We are facing a lot of problems
>>>> For example some of the jobs don't start complaining that there are not
>>>> enough slots available - although there are enough slots  and it seems as
>>>> the job manager is NOT aware of all the task managers
>>>> .In other scenario we were unable to run any job at all
>>>>  The flink dashboard is unresponsive and we get the error
>>>> "flink service temporarily unavailable due to an ongoing leader
>>>> election. please refresh"
>>>> .We believe we are missing some configurations
>>>>  ?Are there any more detailed instructions
>>>> ?And suggestions/tips
>>>>  .Attached is the log of the job manager in one of the attempts
>>>>
>>>> Please give me some advice.
>>>> BR,
>>>> Danny
>>>>
>>>
>>

Re: Flink’s Kubernetes HA services - NOT working

Reply via email to