Hi, I am a newbie to the Flink-kubernetes operator. We are planning to adopt/use it in my company, and it would be great if someone can help answer my questions.
1. It seems like the kubernetes operator is coupled with the auto-scaler. The operator is managing the lifecycle of the Flink jobs in a kubernetes cluster and the auto-scaler is scaling these jobs, depending upon the catchup-time and busy-time configurations. Just trying to understand why this is coupled. It might result in lack of failure isolation(auto-scaler failing causing the deployment to get affected and vice-versa), ability to scale the operator independently of the auto-scaler and deployment of two independent components are tied. Am I understanding this correctly or missing something 2. Pluggability of auto-scaling policies: Currently the auto-scaling policies are not pluggable, i.e, there is only one logic that gets executed as part of the reconciliation loop other than job deployments. Would it be acceptable if we can develop this support(make auto-scaling policies pluggable) in the operator, and contribute it back to upstream? 3. Metrics storage: The metrics that K8s auto-scaler uses are stored in the config map in k8s. Essentially, there is 1 MB limitation on the value of config maps in our k8s cluster and wouldn't this be a bottleneck. So trying to understand why this is the default option in K8s operator. Even though the metric storage option is pluggable, just want to understand the rationale behind this choice. 4. The in-place re-scaling is only supported in native mode(k8s mode) and not supported in auto-scaler-standalone mode. Is it okay if we can develop and contribute this back to the operator upstream? 5. The operator does not support creation of Flink session clusters. We have SQL use-cases with jupyter notebooks for which this might be necessary(testing purposes). Would it be possible if we can develop this support to the operator, contribute it back to upstream? Majority of these questions come from my experience of playing with the operator locally through helm-chart and deployment-yaml samples. They might not be accurate. I am happy to stand corrected. Thanks.