Hello Community, I am checking the reactive mode for Flink deployment. I noticed that this is supported in Kubernetes environment, but only for standalone Kubernetes as of now. I have read some previous discussion threads regarding this issue. See [1][2][3][4][5][6].
Question 1: It seems that due to some interface and design considerations [4] mentioned by Robert and Xintong and official doc[5], this feature is only for standalone k8s and it is not available for native Kubernetes now. However, I believe in theory, it is possible to be added to native Kubernetes, right? Will this be part of the future plan? If not, what is the restriction and is it a hard restriction? Question 2: I have built an native Kubernetes operator on top of Yang’s work [7] supporting various state transfers in native k8s application mode and session mode. Right now, I am seeking for adding some similar features like reactive scaling for native k8s. From my perspective, what I can do is to enable periodic savepoints and scale up/down based certain metrics we collect inside the Flink application. Some additional resource considerations need to be added to implement such feature, similar to the adaptive scheduler concept in [9][10] (I didn’t dive deep into that, I guess I just need to calculated the new TMs will be offered with sufficient k8s resources if the rescale happens?) I think as a user/operator, I am not supposed by to be able to recover/restarted a job from checkpoint [8]. I guess this might cause some performance loss since savepoints are more expensive and the Flink application must do both savepoint and checkpoint periodically… Is there any possible ways that user can also use checkpoints to restart and recover as a user? If Question 1 will be part of the future plan, I guess I won’t need much work here. Reference: [1] Reactive mode blog: https://flink.apache.org/2021/05/06/reactive-mode.html [2] example usage of reactive scaling: https://github.com/rmetzger/flink-reactive-mode-k8s-demo [3] FILP: https://cwiki.apache.org/confluence/display/FLINK/FLIP-159%3A+Reactive+Mode [4] Discussion thread: https://lists.apache.org/thread.html/ra688faf9dca036500f0445c55671e70ba96c70f942afe650e9db8374%40%3Cdev.flink.apache.org%3E [5] Flink doc: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/elastic_scaling/ [6] Flink Jira: https://issues.apache.org/jira/browse/FLINK-10407\<https://issues.apache.org/jira/browse/FLINK-10407/> [7] https://github.com/wangyang0918/flink-native-k8s-operator [8] https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/state/checkpoints/#difference-to-savepoints [9] https://cwiki.apache.org/confluence/display/FLINK/FLIP-138%3A+Declarative+Resource+management [10] https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Adaptive+Scheduler Thanks, Fuyao