Hi Community, Yang, I am new to Flink on native Kubernetes and I am trying to do a POC for native Kubernetes application mode on Oracle Cloud Infrastructure. I was following the documentation here step by step: [1]
I am using Flink 1.12.1, Scala 2.11, java 11. I was able to create a native Kubernetes Deployment, but I am not able to use any further commands like list / cancel etc.. I always run into timeout error. I think the issue could be the JobManager Web Interface IP address printed after job deployment is not accessible. This issue is causing me not able to shut down the deployment with a savepoint. It could be Kubernetes configuration issue. I have exposed all related ports traffic and validated the security list, but still couldn’t make it work. Any help is appreciated. The relevant Flink source code is CliFrontend.java class [2] The ./bin/flink list and cancel command is trying to send traffic to the Flink dashboard UI IP address and it gets timeout. I tried to both LoadBalancer and NodePort option for -Dkubernetes.rest-service.exposed.type configuration. Both of them doesn’t work. # List running job on the cluster (I can’t execute this command successfully due to timeout, logs shared below) $ ./bin/flink list --target kubernetes-application -Dkubernetes.cluster-id=my-first-application-cluster # Cancel running job (I can’t execute this command succcessfully) $ ./bin/flink cancel --target kubernetes-application -Dkubernetes.cluster-id=my-first-application-cluster <jobId> I think those commands needs to communicate with the endpoint that shows after the the job submission command. 1. Use case 1(deploy with NodePort) # fuyli @ fuyli-mac in ~/Development/flink-1.12.1 [17:59:00] C:127 $ ./bin/flink run-application \ --target kubernetes-application \ -Dkubernetes.cluster-id=my-first-application-cluster \ -Dkubernetes.container.image=us-phoenix-1.ocir.io/idxglh0bz964/flink-demo:21.3.1 \ -Dkubernetes.container.image.pull-policy=IfNotPresent \ -Dkubernetes.container.image.pull-secrets=ocirsecret \ -Dkubernetes.rest-service.exposed.type=NodePort \ -Dkubernetes.service-account=flink-service-account \ local:///opt/flink/usrlib/quickstart-0.1.jar When the expose type is NodePort, the printed messages says the the Flink JobManager Web Interface:is at http://192.29.104.156:30996 192.29.104.156 is my Kubernetes apiserver address. 30996 is the port that exposes the service. However, Flink dashboard in this address is not resolvable. I can only get access to dashboard UI on each node IP address(There are three nodes in my K8S cluster) 100.104.154.73:30996 100.104.154.74:30996 100.104.154.75:30996 I got the following errors when trying to do list command for such a native Kubernetes deployment. See in [4]. According to the documentation here [3], this shouldn’t happen since Kubernetes api server address should also have the Flink Web UI… Did I miss any configurations in Kubernetes to make webUI available in Kubernetes apiserver address? 1. Use case 2 (deploy with LoadBalancer) # fuyli @ fuyli-mac in ~/Development/flink-1.12.1 [17:59:00] C:127 $ ./bin/flink run-application \ --target kubernetes-application \ -Dkubernetes.cluster-id=my-first-application-cluster \ -Dkubernetes.container.image=us-phoenix-1.ocir.io/idxglh0bz964/flink-demo:21.3.1 \ -Dkubernetes.container.image.pull-policy=IfNotPresent \ -Dkubernetes.container.image.pull-secrets=ocirsecret \ -Dkubernetes.rest-service.exposed.type=LoadBalancer \ -Dkubernetes.service-account=flink-service-account \ local:///opt/flink/usrlib/quickstart-0.1.jar After a while, when the external IP is resolved. It said Flink JobManager web interface is at the external-IP (LOAD BALANCER address) at: http://144.25.13.78:8081 When I execute the list command, I still got error after waiting for long time to let it get timeout. See errors here. [5] I can still get access to NodeIP:<service-port>. In such case, I tend to believe it is a network issue. But still quite confused since I am already open all the traffics.. Reference: [1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/native_kubernetes.html [2] https://github.com/apache/flink/blob/f3155e6c0213de7bf4b58a89fb1e1331dee7701a/flink-clients/src/main/java/org/apache/flink/client/cli/CliFrontend.java [3] https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/native_kubernetes.html#accessing-flinks-web-ui [4] https://pastebin.ubuntu.com/p/WcJMwds52r/ [5] https://pastebin.ubuntu.com/p/m27BnQGXQc/ Thanks for your help in advance. Best regards, Fuyao