Re: Understanding RocksDBStateBackend in Flink on Yarn on AWS EMR

2024-03-26 Thread Yang Wang
Usually, you should use the HDFS nameservice instead of the NameNode hostname:port to avoid NN failover. And you could find the supported nameservice in the hdfs-site.xml in the key *dfs.nameservices*. Best, Yang On Fri, Mar 22, 2024 at 8:33 PM Sachin Mittal wrote: > So, when we create an EMR

Re: Jobmanager restart after it has been requested to stop

2024-02-02 Thread Yang Wang
If you could find the "Deregistering Flink Kubernetes cluster, clusterId" in the JobManager log, then it is not the expected behavior. Having the full logs of JobManager Pod before restarted will help a lot. Best, Yang On Fri, Feb 2, 2024 at 1:26 PM Liting Liu (litiliu) via user < user@flink.a

Re: [DISCUSS] Hadoop 2 vs Hadoop 3 usage

2024-01-15 Thread Yang Wang
I could share some metrics about Alibaba Cloud EMR clusters. The ratio of Hadoop2 VS Hadoop3 is 1:3. Best, Yang On Thu, Dec 28, 2023 at 8:16 PM Martijn Visser wrote: > Hi all, > > I want to get some insights on how many users are still using Hadoop 2 > vs how many users are using Hadoop 3. Fli

Re: Flink HA with Zookeeper and Docker Compose: unable to startup a working setup.

2024-01-15 Thread Yang Wang
Could you please configure the same HA configurations for TaskManager as well? It seems that the TaskManager container does not use a correct URL when contacting with ResourceManager. Best, Yang On Fri, Dec 29, 2023 at 11:13 PM Alessio Bernesco Làvore < alessio.berne...@gmail.com> wrote: > Hell

Re: Deploying the K8S operator sample on GKE Autopilot : Association with remote system [akka.tcp://flink@basic-example.default:6123] has failed,

2024-01-15 Thread Yang Wang
Could you please directly use the JobManager Pod IP address instead of K8s service name(basic-example.default) and have a try with curl/wget? It seems that the JobManager K8s service could not be accessed. Best, Yang On Sat, Jan 13, 2024 at 1:24 AM LINZ, Arnaud wrote: > Hi, > > Some more tests

Re: Flink Kubernetes HA

2024-01-15 Thread Yang Wang
The fabric8 K8s client is using PATCH to replace get-and-update in v6.6.2. That's why you also need to give PATCH permission for the K8s service account. This would help to decrease the pressure of K8s APIServer. You could find more information here[1]. [1]. https://issues.apache.org/jira/browse/F

Re: Default Log4j properties in Native Kubernetes

2023-06-20 Thread Yang Wang
I assume you are using "*bin/flink run-application*" to submit a Flink application to K8s cluster. Then you could simply update your local log4j-console.properties, it will be shipped and mounted to JobManager/TaskManager pods via ConfigMap. Best, Yang Vladislav Keda 于2023年6月20日周二 22:15写道: > Hi

Re: "Error while retrieving the leader gateway" when using Kubernetes HA

2023-01-30 Thread Yang Wang
I assume you are using the standalone mode. Right? For the native K8s mode, the leader address should be *akka.tcp://flink@JM_POD_IP:6123/user/rpc/dispatcher_1 *when HA enabled. Best, Yang Anton Ippolitov via user 于2023年1月31日周二 00:21写道: > This is actually what I'm already doing, I'm only sett

Re: Apache Beam MinimalWordCount on Flink on Kubernetes using Flink Kubernetes Operator on GCP

2023-01-17 Thread Yang Wang
The "JAR file does not exist" exception comes from the JobManager side, not on the client. Please be aware that the local:// scheme in the jarURI means the path in the JobManager pod. You could use an init-container to download your user jar and mount it to the JobManager main-container. Refer to

Re: Supplying jar stored at S3 to flink to run the job in kubernetes

2023-01-16 Thread Yang Wang
Do you build your own flink-kubernetes-operator image with the flink-s3-fs plugin bundled[1]? [1]. https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.3/docs/custom-resource/overview/#flinksessionjob-spec-overview Best, Yang Weihua Hu 于2023年1月17日周二 10:47写道: > Hi, Rahul

Re: Flink Job Manager Recovery from EKS Node Terminations

2023-01-11 Thread Yang Wang
First, JobManager does not store any persistent data to local when the Kubernetes HA + S3 used. It means that you do not need to mount a PV for JobMananger deployment. Secondly, node failures or terminations should not cause the CrashLoopBackOff status. One possible reason I could imagine is a bug

Re: The use of zookeeper in flink

2023-01-03 Thread Yang Wang
The reason why the running jobs try to failover with zookeeper outage is that the JobManager lost leadership. Having a standby JobManager or not makes no difference. Best, Yang Matthias Pohl via user 于2023年1月2日周一 20:51写道: > And I screwed up the reply again. -.- Here's my previous response for t

Re: How to get failed streaming Flink job log in Flink Native K8s mode?

2023-01-03 Thread Yang Wang
I think you might need a sidecar container or daemonset to collect the Flink logs and store into a persistent storage. You could find more information here[1]. [1]. https://www.alibabacloud.com/blog/best-practices-of-kubernetes-log-collection_596356 Best, Yang hjw 于2022年12月22日周四 23:28写道: > On

Re: Stand alone K8s HA mode with Static Tokens Used by Service Accounts

2022-11-24 Thread Yang Wang
IIUC, the fabric8 Kubernetes-client 5.5.0 should already support to reload the latest kube config if received 401 error. Refer to the following PR[1] for more information. Please share your feedback here if it still could not work. [1]. https://github.com/fabric8io/kubernetes-client/pull/2731 Be

Re: support escaping `#` in flink job spec in Flink-operator

2022-11-08 Thread Yang Wang
This is a known limit of the current Flink options parser. Refer to FLINK-15358[1] for more information. [1]. https://issues.apache.org/jira/browse/FLINK-15358 Best, Yang Gyula Fóra 于2022年11月8日周二 14:41写道: > It is also possible that this is a problem of the Flink native Kubernetes > integration

Re: [DISCUSS ] add --jars to support users dependencies jars.

2022-10-27 Thread Yang Wang
Thanks Jacky Lau for starting this discussion. I understand that you are trying to find a convenient way to specify dependency jars along with user jar. However, let's try to narrow down by differentiating deployment modes. # Standalone mode No matter you are using the standalone mode on virtual

Re: configMap value error when using flink-operator?

2022-10-26 Thread Yang Wang
Maybe we could change the values of *taskmanager.numberOfTaskSlots* and *parallelism.default *in flink-conf.yaml of Kubernetes operator to 1, which are aligned with the default values in Flink codebase. Best, Yang Gyula Fóra 于2022年10月26日周三 15:17写道: > Hi! > > I agree that this might be confusin

Re: Flink Native K8S RBAC

2022-10-20 Thread Yang Wang
I have created a ticket[1] to fill the missing part in the native K8s documentation. [1]. https://issues.apache.org/jira/browse/FLINK-29705 Best, Yang Gyula Fóra 于2022年10月20日周四 13:37写道: > Hi! > > As a reference you can look at how the Flink Kubernetes Operator manages > RBAC settings: > > > ht

Re: Activate Flink HA without checkpoints on k8S

2022-10-19 Thread Yang Wang
Add some more information to Gyula's comment. For application mode without checkpoint, you do not need to activate the HA since it will not take any effect and the Flink job will be submitted again after the JobManager restarted. Because the job submission happens on the JobManager side. For sess

Re: fail to mount hadoop-config-volume when using flink-k8s-operator

2022-10-13 Thread Yang Wang
Currently, exporting the env "HADOOP_CONF_DIR" could only work for native K8s integration. The flink client will try to create the hadoop-config-volume automatically if hadoop env found. If you want to set the HADOOP_CONF_DIR in the docker image, please also make sure the specified hadoop conf dir

Re: serviceAccount permissions issue for high availability in operator 1.1

2022-09-20 Thread Yang Wang
ay to change that to use standalone K8s? I haven't > seen anything about that in the docs, besides a mention that standalone > support is coming in version 1.2 of the operator. > > Thanks, > > Javier > > > On Thu, Sep 8, 2022, 22:50 Yang Wang wrote: > >> Si

Re: serviceAccount permissions issue for high availability in operator 1.1

2022-09-08 Thread Yang Wang
Since the flink-kubernetes-operator is using native K8s integration[1] by default, you need to give the permissions of pod and deployment as well as ConfigMap. You could find more information about the RBAC here[2]. [1]. https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/r

Re: [Flink 1.15.1 - Application mode native k8s Exception] - Exception occurred while acquiring lock 'ConfigMapLock

2022-09-08 Thread Yang Wang
t should be a warning then > > What about the 1st error we encountered regarding the kube/config file > exception? > > > Thank you so much, > Best, > Tamir > > -- > *From:* Yang Wang > *Sent:* Thursday, September 8, 2022 7:08 AM >

Re: [Flink Kubernetes Operator] FlinkSessionJob crd spec jarURI

2022-09-08 Thread Yang Wang
Given that the "local://" schema means the jar is available in the image/container of JobManager, so it could only be supported in the K8s application mode. If you configure the jarURI to "file://" schema for session cluster, it means that this jar file should be available in the flink-kubernetes-

Re: Deploying Jobmanager on k8s as a Deployment

2022-09-07 Thread Yang Wang
For native K8s integration, the Flink ResourceManager will delete the JobManager K8s deployment as well as the HA data once the job reached a globally terminal state. However, it is indeed a problem for standalone mode since the JobManager will be restarted again even the job has finished. I think

Re: [Flink 1.15.1 - Application mode native k8s Exception] - Exception occurred while acquiring lock 'ConfigMapLock

2022-09-07 Thread Yang Wang
"data-agg-events-insertion-cluster-config-map" already > exists. > > Log file is enclosed. > > Thanks, > Tamir. > > -- > *From:* Yang Wang > *Sent:* Monday, September 5, 2022 3:03 PM > *To:* Tamir Sagi > *Cc:* user@flink.apache.org ; Lihi Peretz < &g

Re: [Flink 1.15.1 - Application mode native k8s Exception] - Exception occurred while acquiring lock 'ConfigMapLock

2022-09-05 Thread Yang Wang
Could you please check whether the "kubernetes.config.file" is configured to /opt/flink/.kube/config in the Flink configmap? It should be removed before creating the Flink configmap. Best, Yang Tamir Sagi 于2022年9月4日周日 18:08写道: > Hey All, > > We recently updated to Flink 1.15.1. We deploy stream

Re: How to open a Prometheus metrics port on the rest service when using the Kubernetes operator?

2022-09-05 Thread Yang Wang
I do not think we could add an additional port to the rest service since it is created by Flink internally. Actually, I do not suggest scrapping the metrics from rest service. Instead, the port in the pod needs to be used. Because the metrics might not work correctly if multiple JobManagers are ru

Re: [E] Re: Kubernetes operator expose UI rest service as NodePort instead of default clusterIP

2022-09-05 Thread Yang Wang
ed to "ip" might work. Let me try that. I >>> believe there should be a reason to always override the >>> "REST_SERVICE_EXPOSED_TYPE" to "ClusterIP". >>> >>> [1] https://docs.aws.amazon.com/eks/latest/userguide/alb-ingress.html >

Re: Kubernetes operator expose UI rest service as NodePort instead of default clusterIP

2022-09-01 Thread Yang Wang
I am afraid the current flink-kubernetes-operator always overrides the "REST_SERVICE_EXPOSED_TYPE" to "ClusterIP". Could you please share why the ingress[1] could not meet your requirements? Compared with NodePort, I think it is a more graceful implementation. [1]. https://nightlies.apache.org/fli

Re: Error when run test case in Windows

2022-08-22 Thread Yang Wang
It is caused by the following assert. Maybe we could *File.pathSeparator* instead of "/". *assertThat(optional.get()).isEqualTo(hadoopHome + "/conf");* Would you like to create a ticket and attach a PR for this issue? Best, Yang hjw <1010445...@qq.com> 于2022年8月21日周日 19:44写道: > When I run mvn c

Re: Flink Operator Resources Requests and Limits

2022-07-27 Thread Yang Wang
We have the *kubernetes.jobmanager.cpu.limit-factor* and *kubernetes.jobmanager.memory.limit-factor* to control the limit value. The resources limit memory will be set to memory/cpu * limit-factor. Best, Yang PACE, JAMES 于2022年7月28日周四 01:26写道: > That does not seem to work. > > > > For instanc

Re: NodePort conflict for multiple HA application-mode standalone Kubernetes deploys in same namespace

2022-07-24 Thread Yang Wang
Removing the nodePort for every different Flink application is necessary so that it could pick up a random port. Moreover, I believe you also need to change some other yamls. For example, having a different name for JobManager/TaskManager yamls, update the jobmanager-service.yaml and flink-configu

Re: [ANNOUNCE] Apache Flink Kubernetes Operator 1.1.0 released

2022-07-24 Thread Yang Wang
Congrats! Thanks Gyula for driving this release, and thanks to all contributors! Best, Yang Gyula Fóra 于2022年7月25日周一 10:44写道: > The Apache Flink community is very happy to announce the release of Apache > Flink Kubernetes Operator 1.1.0. > > The Flink Kubernetes Operator allows users to manage

Re: standalone mode support in the kubernetes operator (FLIP-25)

2022-07-18 Thread Yang Wang
ster? > What is the advantag doing so. > > Yang Wang 于2022年7月14日周四 10:55写道: > > > > I think the standalone mode support is expected to be done in the > version 1.2.0[1], which will be released on Oct 1 (ETA). > > > > [1]. > https://cwiki.apache.org/confluence/di

Re: standalone mode support in the kubernetes operator (FLIP-25)

2022-07-13 Thread Yang Wang
I think the standalone mode support is expected to be done in the version 1.2.0[1], which will be released on Oct 1 (ETA). [1]. https://cwiki.apache.org/confluence/display/FLINK/Release+Schedule+and+Planning Best, Yang Javier Vegas 于2022年7月14日周四 06:25写道: > Hello! The operator docs > https://n

Re: [ANNOUNCE] Apache Flink Kubernetes Operator 1.0.1 released

2022-06-28 Thread Yang Wang
Thanks Gyula for working on the first patch release for the Flink Kubernetes Operator project. Best, Yang Gyula Fóra 于2022年6月28日周二 00:22写道: > The Apache Flink community is very happy to announce the release of Apache > Flink Kubernetes Operator 1.0.1. > > The Flink Kubernetes Operator allows

Re: Flink k8s Operator on AWS?

2022-06-26 Thread Yang Wang
Could you please share the JobManager logs of failed deployment? It will also help a lot if you could show the pending pod status via "kubectl describe ". Given that the current Flink Kubernetes Operator is built on top of native K8s integration[1], the Flink ResourceManager should allocate enough

Re: Flink operator - ignore ssl cert validation

2022-06-23 Thread Yang Wang
Do you mean the HttpArtifactFetcher could not support HTTPS? cc @Aitozi Best, Yang calvin beloy 于2022年6月22日周三 22:10写道: > Sorry typo "jarring" should be "jar url". > > Sent from Yahoo Mail on Android >

Re: Flink Operator - Support for k8s HA jobmanager

2022-06-23 Thread Yang Wang
Matyas's answer is on the point. You need to mount a shared volume for all the JobManager pods so that the uploaded jars are visible for them all. Best, Yang Őrhidi Mátyás 于2022年6月23日周四 04:34写道: > I guess the problem here is that your JM pods do not have access to a > common upload folder. You

Re: HTTP 404 while creating resource with flink kubernetes operator and frabric8 client

2022-06-23 Thread Yang Wang
Do you have installed the operator along with CRD[1]? [1]. https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.0/docs/try-flink-kubernetes-operator/quick-start/#deploying-the-operator Best, Yang yu'an huang 于2022年6月23日周四 13:04写道: > Hi, > > It seems that you can't find t

Re: Flink k8s Operator on AWS?

2022-06-23 Thread Yang Wang
gt; just to get a few files into a pod container... I feel it's just a bit > much. > But again, this is my frustration with k8s, not with Flink ;-) > > Cheers, > Matt > > On Wed, Jun 22, 2022 at 5:32 AM Yang Wang wrote: > >> Matyas and Gyula have shared many great

Re: Flink k8s Operator on AWS?

2022-06-21 Thread Yang Wang
Matyas and Gyula have shared many great informations about how to make the Flink Kubernetes Operator work on the EKS. One more input about how to prepare the user jars. If you are more familiar with K8s, you could use persistent volume to provide the user jars and them mount the volume to JobManag

Re: Flink Kubernetes Operator with K8S + Istio + mTLS - port definitions

2022-06-16 Thread Yang Wang
Could you please have a try with high availability enabled[1]? If HA enabled, the internal jobmanager rpc service will not be created. Instead, the TaskManager retrieves the JobManager address via HA services and connects to it via pod ip. [1]. https://github.com/apache/flink-kubernetes-operator/

[ANNOUNCE] Apache Flink Kubernetes Operator 1.0.0 released

2022-06-05 Thread Yang Wang
The Apache Flink community is very happy to announce the release of Apache Flink Kubernetes Operator 1.0.0. The Flink Kubernetes Operator allows users to manage their Apache Flink applications and their lifecycle through native k8s tooling like kubectl. This is the first production ready release a

Re: Flink Kubernetes Operator v1.0 ETA

2022-06-01 Thread Yang Wang
If everything goes well, I will close the VOTE for RC4 on Friday night, which should run for more than 48 hours. And then finalize the release. Best, Yang Gyula Fóra 于2022年6月1日周三 23:30写道: > Hi Jeesmon! > > We are currently working through the release process. We are now in the > middle of votin

Re: multiple pipeline deployment using flink k8s operator

2022-06-01 Thread Yang Wang
The current application mode has the limitation that only one job could be submitted when HA enabled[1]. So a feasible solution is to use the session mode[2], it will be supported in the coming release-1.0.0. However, I am afraid it still could not satisfy your requirement "2 task managers (one pe

Re: Deployment on k8s via API

2022-05-17 Thread Yang Wang
Maybe you could have a try on the flink-kubernetes-operator[1]. It is designed for using Kubernetes CRD to manage the Flink applications. [1]. https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-0.1/ Best, Yang Devin Bost 于2022年5月18日周三 08:29写道: > Hi, > > I'm looking at m

Re: Running in application mode on YARN without fat jar

2022-05-16 Thread Yang Wang
The usrlib for YARN only works for 1.15.0 and later versions. Refer to the ticket[1] for more information. [1]. https://issues.apache.org/jira/browse/FLINK-24897 Best, Yang Pavel Penkov 于2022年5月16日周一 22:59写道: > I can't manage to run an application on YARN because of classpath issues. > Flink d

Re: Flink on Native K8s jobs turn in to `SUSPENDED` status unexpectedly.

2022-05-16 Thread Yang Wang
e were no more logs there. > > How can I reproduce the issue ? > > On Thu, May 12, 2022 at 10:35 AM Yang Wang wrote: > >> The SUSPENDED state is usually caused by lost leadership. Maybe you could >> find more information about leader in the JobManager and TaskManager logs.

Re: Flink on Native K8s jobs turn in to `SUSPENDED` status unexpectedly.

2022-05-11 Thread Yang Wang
The SUSPENDED state is usually caused by lost leadership. Maybe you could find more information about leader in the JobManager and TaskManager logs. Best, Yang Xiaolong Wang 于2022年5月11日周三 19:18写道: > Hello, > > Recently our Flink jobs on Native K8s encountered failing in the > `SUSPENDED` status

Re: Flink Kubernetes operator not having a scale subresource

2022-05-06 Thread Yang Wang
Currently, the flink-kubernetes-operator is using Flink native K8s integration[1], which means Flink ResourceManager will dynamically allocate TaskManager on demand. So the users do not need to specify the replicas of TaskManager. Just like Gyula said, one possible solution to make "kubectl scale"

Re: Using the official flink operator and kubernetes secrets

2022-05-04 Thread Yang Wang
--kafka.bootstrap.servers" >> - "my.kafka.host:9093" >> - "--kafka.sasl.username" >> - "$(KAFKA_SASL_USERNAME)" >> - "--kafka.sasl.password" >> - "$(KAFKA_SASL_PASSWORD)" >> ​ >> >> It would be a gr

Re: Using the official flink operator and kubernetes secrets

2022-05-03 Thread Yang Wang
Flink could not support environment replacement in the args. I think you could access the env via "*System.getenv()*" in the user main method. It should work since the user main method is executed in the JobManager side. Best, Yang Őrhidi Mátyás 于2022年4月28日周四 19:27写道: > Also, > > just declaring

Re: flink operator sometimes cannot start jobmanager after upgrading

2022-05-02 Thread Yang Wang
I am afraid we do not handle the scenario that the JobManager deployment is deleted externally. Best, Yang Őrhidi Mátyás 于2022年5月2日周一 16:52写道: > I filed a Jira for tracking this issue: > https://issues.apache.org/jira/browse/FLINK-27468 > > On Mon, May 2, 2022 at 10:31 AM Őrhidi Mátyás > wrote

Re: how to setup working dir in Flink operator

2022-04-25 Thread Yang Wang
Using the pod template to configure the local SSD(via host-path or local PV) is the correct way. After that, either "java.io.tmpdir" or "process.taskmanager.working-dir" in CR should take effect. Maybe you need to share the complete pod yaml and logs of failed TaskManager. nit: if the TaskManager

Re: JobManager doesn't bring up new TaskManager during failure recovery

2022-04-23 Thread Yang Wang
k this issue. > > > > BRs, > > Chenyu > > > > *From: *"Zheng, Chenyu" > *Date: *Friday, April 22, 2022 at 6:26 PM > *To: *Yang Wang > *Cc: *"user@flink.apache.org" , " > user...@flink.apache.org" > *Subject: *Re: JobManager d

Re: JobManager doesn't bring up new TaskManager during failure recovery

2022-04-22 Thread Yang Wang
The root cause might be you APIServer is overloaded or not running normally. And then all the pods events of taskmanager-1-9 and taskmanager-1-10 are not delivered to the watch in FlinkResourceManager. So the two taskmanagers are not recognized by ResourceManager and then registration are rejected.

Re: Kubernetes killing TaskManager - Flink ignoring taskmanager.memory.process.size

2022-04-21 Thread Yang Wang
Could you please configure a bigger memory to avoid OOM and use NMTracker[1] to figure out the memory usage categories? [1]. https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/tooldescr007.html Best, Yang Dan Hill 于2022年4月21日周四 07:42写道: > Hi. > > I upgraded to Flink v1.14.4 an

Re: Enabling savepoints when deploying in Application Mode

2022-04-12 Thread Yang Wang
If you are trying to submit a job to an already-running application via "flink run", then it will not succeed. Because this is the by-design behavior. Please note that triggering a savepoint will also update the checkpoint information in HA ConfigMap, so deleting the deployment(with HA ConfigMap r

Re: Official Flink operator additional class paths

2022-04-07 Thread Yang Wang
It seems that you have a typo when specifying the pipeline classpath. "file:///flink-jar/flink-connector-rabbitmq_2.12-1.14.4.jar" -> "file:///flink-jars/flink-connector-rabbitmq_2.12-1.14.4.jar" If this is not the root cause, maybe you could have a try with downloading the connector jars to /opt/

Re: The flink-kubernetes-operator vs third party flink operators

2022-04-05 Thread Yang Wang
Thanks for the interest on the flink-kubernetes-operator project. I believe you could leave a comment in the ticket FLINK-27049. If the reporter has not start working on this ticket, then you could be assigned. Best, Yang Hao t Chang 于2022年4月6日周三 06:30写道: > Hi Gyula > > > > Thanks for the reply

Re: flink cluster startup time

2022-03-30 Thread Yang Wang
@Gyula Fóra is trying to prepare the preview release(0.1) for flink-kubernetes-operator. It now is fully functional for application mode. You could have a try and share more feedback with the community. The release-1.0 aims for production ready. And we still miss some important pieces(e.g. FlinkS

Re: "Native Kubernetes" sample in Flink documentation fails. JobManager Web Interface is wrongly generated. [Flink 1.14.4]

2022-03-28 Thread Yang Wang
ubernetes.config.file=/home/devuser/.kube/config > examples/batch/WordCount.jar > > > > Best regards, > > Burcu > > > > *From:* Yang Wang [mailto:danrtsey...@gmail.com] > *Sent:* Saturday, March 26, 2022 7:48 AM > *To:* Burcu Gul POLAT EGRI > *Cc:* user@flink.a

Re: JobManager failed to renew it's leadership (K8S HA)

2022-03-27 Thread Yang Wang
Could you please verify whether the JobManager is going through a long full GC or the Kubernetes APIServer is working well at that moment? We are using Kubernetes HA service in the production and it seems stable without your issue. Best, Yang marco andreas 于2022年3月27日周日 18:35写道: > > Hello, >

Re: Deploy a Flink session cluster natively on K8s with multi AZ

2022-03-27 Thread Yang Wang
> In the example, we can pass args in the command, is there a way to do it by using the flink-conf.yaml? Yes. All the changes in the $FLINK_HOME/conf/flink-conf.yaml at the client side will also be picked up when deploying a native K8s cluster. For your use case, I am also suggesting the flink-ku

Re: flink-kubernetes-operator: Flink deployment stuck in scheduled state when increasing resource CPU above 1

2022-03-25 Thread Yang Wang
Could you please share the result of "kubectl describe pods" when getting stuck? It will be very useful to help to figure out the root cause. I guess it might be related to insufficient resources for minikube. Best, Yang Őrhidi Mátyás 于2022年3月26日周六 03:12写道: > It's worth checking the deploymen

Re: "Native Kubernetes" sample in Flink documentation fails. JobManager Web Interface is wrongly generated. [Flink 1.14.4]

2022-03-25 Thread Yang Wang
The root cause might be the LoadBalancer could not really work in your environment. We already have a ticket to track this[1] and will try to get it resolved in the next release. For now, could you please have a try by adding "-Dkubernetes.rest-service.exposed.type=NodePort" to your session and su

Re: Kubernetes HA on an application cluster

2022-03-21 Thread Yang Wang
This log means the Flink internal leader elector failed to renew the leader ConfigMap to keep its leadership. It might be caused by a network issue, long fullGC or the K8s APIServer internal error. This blog[1] could help you to know how the Kubernetes HA works. [1]. https://flink.apache.org/2021

Re: Submit job to a session cluster on Kubernetes via REST API

2022-03-06 Thread Yang Wang
If you want to use the RestClusterClient to do the job submission and lifecycle management, the implementation in the flink-kubernetes-operator[1] project may give you some insights. You could also use /jars/:jarid/run[2] to run a Flink job. It is a pure HTTP interface. [1]. https://github.com/a

Re: [Flink-1.14.3] Restart of pod due to duplicatejob submission

2022-02-24 Thread Yang Wang
This might be related with FLINK-21928 and seems already fixed in 1.14.0. But it will have some limitations and users need to manually clean up the HA entries. Best, Yang Parag Somani 于2022年2月24日周四 13:42写道: > Hello, > > Recently due to log4j vulnerabilities, we have upgraded to Apache Flink >

Re: No effect from --allowNonRestoredState or "execution.savepoint.ignore-unclaimed-state" in K8S application mode

2022-02-22 Thread Yang Wang
last resort but didn't > think to do it together with "--fromSavepoint". > > Thanks again! > > On Sun, Feb 20, 2022 at 9:49 PM Yang Wang wrote: > >> By design, we should support arbitrary config keys via the CLI when using >> generic CLI mode. >&g

Re: No effect from --allowNonRestoredState or "execution.savepoint.ignore-unclaimed-state" in K8S application mode

2022-02-20 Thread Yang Wang
le=true \ > > The only one i'm having problems with is > "execution.savepoint.ignore-unclaimed-state". > > On Fri, Feb 18, 2022 at 3:42 PM Austin Cawley-Edwards < > austin.caw...@gmail.com> wrote: > >> Hi Andrey, >> >> It's unclear to me from

Re: Flink 1.14.2 - Log4j2 -Dlog4j.configurationFile is ignored and falls back to default /opt/flink/conf/log4j-console.properties

2022-01-24 Thread Yang Wang
ge prior cluster creation; all logs' files are there. > once the cluster is deployed, they are missing. (bug?) > > Best, > Tamir. > -- > *From:* Tamir Sagi > *Sent:* Friday, January 21, 2022 7:19 PM > *To:* Yang Wang > *Cc:* user@flink.apa

Re: Flink 1.14.2 - Log4j2 -Dlog4j.configurationFile is ignored and falls back to default /opt/flink/conf/log4j-console.properties

2022-01-20 Thread Yang Wang
ePathList > "$FLINK_TM_CLASSPATH:$INTERNAL_HADOOP_CLASSPATHS"`" ${CLASS_TO_RUN} > "${ARGS[@]}" "${FLINK_ENV_JAVA_OPTS}" > > Unless there are system configurations which not supposed to be overridden > by user(And then having dedicated env variables

Re: Flink 1.14.2 - Log4j2 -Dlog4j.configurationFile is ignored and falls back to default /opt/flink/conf/log4j-console.properties

2022-01-17 Thread Yang Wang
lse, then the logging properties are not added to start > command(Might be the case, which explains why it works in 1.12.2) > > It then passes 'jobmanager' as component. > looking into /docker-entrypoint.sh it calls jobmanager.sh which calls > flink-console.sh internally > > Have I missed anythin

Re: [E] Re: Orphaned job files in HDFS

2022-01-17 Thread Yang Wang
: > Ok, that makes sense. I did see some job failures. However failures > could happen occasionally. Is there any option to have the job manager > clean-up these directories when the job has failed? > > On Mon, Jan 10, 2022 at 8:58 PM Yang Wang wrote: > >> IIRC, the stagi

Re: Flink 1.14.2 - Log4j2 -Dlog4j.configurationFile is ignored and falls back to default /opt/flink/conf/log4j-console.properties

2022-01-17 Thread Yang Wang
I think the root cause is that we are using "flink-console.sh" to start the JobManager/TaskManager process for native K8s integration after FLINK-21128[1]. So it forces the log4j configuration name to be "log4j-console.properties". [1]. https://issues.apache.org/jira/browse/FLINK-21128 Best, Ya

Re: Flink native k8s integration vs. operator

2022-01-17 Thread Yang Wang
gt;> >> Flink >>> >> > to make it easy for everyone to build fitting higher level >>> abstractions >>> >> > like a FlinkApplication Custom Resource on top of it. For example, >>> we >>> >> are >>> >> &

Re: Orphaned job files in HDFS

2022-01-10 Thread Yang Wang
IIRC, the staging directory(/user/{name}/.flink/application_xxx) will be deleted automatically if the Flink job reaches global terminal state(e.g. FINISHED, CANCELED, FAILED). So I assume you have stopped the yarn application via "yarn application -kill", not via "bin/flink cancel". If it is the ca

Re: Unable to update logback configuration in Flink Native Kubernetes

2022-01-09 Thread Yang Wang
Sorry for the late reply. Flink clients will ship the log4j-console.properties and logback-console.xml via K8s ConfigMap and then mount to JobManager/TaskManager pod. So if you want to update the log settings or using logback, all you need is to update the client-local files. Best, Yang Ragha

Re: Pod Disruption in Flink Kubernetes Cluster

2022-01-09 Thread Yang Wang
Maybe the Flink applications could run more stably if you configure enough resources(e.g. memory, cpu, ephemeral-storage) for the JobManager and TaskManager pods. Best, Yang David Morávek 于2022年1月5日周三 16:46写道: > Hi Tianyi, > > this really depends on your kubernetes setup (eg. if autoscaling is

Re: Flink native k8s integration vs. operator

2022-01-09 Thread Yang Wang
Thanks all for this fruitful discussion. I think Xintong has given a strong point why we introduced the native K8s integration, which is active resource management. I have a concrete example for this in the production. When a K8s node is down, the standalone K8s deployment will take longer recover

Re: Flink On Native K8s hostAliases in Pod-template

2021-12-23 Thread Yang Wang
Hi, The pod template file when you submit a Flink application via "flink run-application ... -Dkubernetes.pod-template-file=/path/of/pod-template.yaml" is a *client-local* file. You do not need to bundle it into the docker image. Best, Yang 黄剑文 于2021年12月23日周四 23:00写道: > Flink version:1.13 > I

Re: Flink fails to load class from configured classpath using PipelineOptions

2021-12-20 Thread Yang Wang
the StreamExecutionEnvironment, and I am not sure how that can > be done. > > Pouria > > > On Thu, Dec 16, 2021 at 8:46 PM Yang Wang wrote: > >> The config option "pipeline.jars" is used to specify the user jar, which >> contains the main class. >> I thin

Re: Flink fails to load class from configured classpath using PipelineOptions

2021-12-16 Thread Yang Wang
The config option "pipeline.jars" is used to specify the user jar, which contains the main class. I think what you need is "pipeline.classpaths". /** * A list of URLs that are added to the classpath of each user code classloader of the program. * Paths must specify a protocol (e.g. file://) and

Re: Flink 1.13.3, k8s HA - ResourceManager was revoked leadership

2021-12-15 Thread Yang Wang
Could you please check whether the JobManager has a long fullGC, which will cause the leadership lost? BTW, increasing the timeout should help. high-availability.kubernetes.leader-election.lease-duration: 60s high-availability.kubernetes.leader-election.renew-deadline: 60s Best, Yang Alexey Tr

Re: [DISCUSS] Drop Zookeeper 3.4

2021-12-11 Thread Yang Wang
After FLINK-10052[1], which was merged in 1.14.0, rolling upgrading ZooKeeper will not affect the running Flink application. 1. https://issues.apache.org/jira/browse/FLINK-10052 Best, Yang Chesnay Schepler 于2021年12月7日周二 下午4:37写道: > Since this is only relevant for 1.15, if you intend to migrat

Re: [DISCUSS] Drop Zookeeper 3.4

2021-12-06 Thread Yang Wang
FYI: We(Alibaba) are widely using ZooKeeper 3.5.5 for all the YARN and some K8s Flink high-available applications. Best, Yang Chesnay Schepler 于2021年12月7日周二 上午2:22写道: > Current users of ZK 3.4 and below would need to upgrade their Zookeeper > installation that is used by Flink to 3.5+. > > Wh

Re: Kubernetes HA: New jobs stuck in Initializing for a long time after a certain number of existing jobs are running

2021-11-22 Thread Yang Wang
I believe this issue[1] is related and has been fixed in 1.13.0 and 1.12.3. [1]. https://issues.apache.org/jira/browse/FLINK-22006 Best, Yang Matthias Pohl 于2021年11月22日周一 下午9:19写道: > Hi Joey, > that looks like a cluster configuration issue. The 192.168.100.79:6123 is > not accessible from th

Re: Could not retrieve submitted JobGraph from state handle

2021-11-16 Thread Yang Wang
Hi Alexey, If you delete the HA data stored in the S3 manually or maybe you configured an automatic clean-up rule, then it could happen that the ConfigMap has the pointers while the concrete data in the S3 is missing. > How to clean the state handle store? Since the handle is stored in the Confi

Re: Fabric8 does not support EC keys

2021-11-16 Thread Yang Wang
rnetes/client/Config.java#L79) > but none seems to work :( > > I'm launching flink with this: /flink-1.14.0/bin/flink run-application ... > > Thanks! > > On Mon, Nov 15, 2021 at 4:08 AM Yang Wang wrote: > >> It seems that "EC"[1] is already supported in Kube

Re: Fabric8 does not support EC keys

2021-11-14 Thread Yang Wang
It seems that "EC"[1] is already supported in Kubernetes client v5.5.0. However, the default value is "RSA". Could you please export the following environment first and have a try again? export KUBERNETES_CLIENT_KEY_ALGO_SYSTEM_PROPERTY=EC [1]. https://github.com/fabric8io/kubernetes-client/blob/

Re: Getting Errors in Standby Jobmanager pod during installation & after restart on k8s

2021-10-27 Thread Yang Wang
Roman's answer is on the point. The exception is really confusing and it comes from fabric8 kubernetes-client. We might try to create a PR for the upstream project :) Best, Yang Roman Khachatryan 于2021年10月25日周一 下午10:00写道: > Hi Amit, > > AFAIK, these exceptions are normal in HA mode as differen

Re: Not cleanup Kubernetes Configmaps after execution success

2021-10-27 Thread Yang Wang
Hi, I think Roman is right. It seems that the JobManager is relaunched again by K8s after Flink has already deregister the application(aka delete the JobManager K8s deployment). One possible reason might be that kubelet is too late to know the JobManager deployment is deleted. So it relaunch the

Re: Why we need again kubernetes flink operator?

2021-10-25 Thread Yang Wang
Hi Bhaskar, IIUC, flink-k8s-operator and Flink native K8s mode are orthogonal. They do not mean to replace other one. The flink-k8s-operator is more like a Flink lifecycle management tool. It could make deploying a Flink application on K8s easier. We just need to apply a CR yaml and is more frien

Re: High availability data clean up

2021-10-25 Thread Yang Wang
Hi Weiqing, > Why does Flink not set the owner reference of HA related ConfigMaps to JobManager deployment? It is easier to clean up for users. The major reason is that simply deleting the HA related ConfigMaps will make the HA data located in DFS leak. > How to delete the HA ConfigMap from exter

Re: Not cleanup Kubernetes Configmaps after execution success

2021-10-25 Thread Yang Wang
Hi Hua Wei, I think you need to share the JobManager logs so that we could check whether Flink had tried to clean up the HA related ConfigMaps. Using the "kubectl logs -f >/tmp/log.jm" could help with dumping the logs. Best, Yang Roman Khachatryan 于2021年10月25日周一 下午5:35写道: > Hi Hua, > > It lo

Re: Kubernetes HA - Reusing storage dir for different clusters

2021-10-08 Thread Yang Wang
right? That's a somewhat non-standard scenario, so I wouldn't > expect Flink to clean up, I just want to be sure. > > Regards, > Alexis. > > ------ > *From:* Yang Wang > *Sent:* Friday, October 8, 2021 5:24 AM > *To:* Alexis Sarda-Espino

Re: Start Flink cluster, k8s pod behavior

2021-10-08 Thread Yang Wang
Did you use the "jobmanager.sh start-foreground" in your own "run-job-manager.sh", just like what Flink has done in the docker-entrypoint.sh[1]? I strongly suggest to start the Flink session cluster with official yamls[2]. [1]. https://github.com/apache/flink-docker/blob/master/1.13/scala_2.11-ja

  1   2   3   4   5   >