Hello Community,

I am checking the reactive mode for Flink deployment. I noticed that this is 
supported in Kubernetes environment, but only for standalone Kubernetes as of 
now. I have read some previous discussion threads regarding this issue. See 
[1][2][3][4][5][6].

Question 1:
It seems that due to some interface and design considerations [4] mentioned by 
Robert and Xintong and official doc[5], this feature is only for standalone k8s 
and it is not available for native Kubernetes now. However, I believe in 
theory, it is possible to be added to native Kubernetes, right? Will this be 
part of the future plan? If not, what is the restriction and is it a hard 
restriction?

Question 2:
I have built an native Kubernetes operator on top of Yang’s work [7] supporting 
various state transfers in native k8s application mode and session mode. Right 
now, I am seeking for adding some similar features like reactive scaling for 
native k8s. From my perspective, what I can do is to enable periodic savepoints 
and scale up/down based certain metrics we collect inside the Flink 
application. Some additional resource considerations need to be added to 
implement such feature, similar to the adaptive scheduler concept in [9][10] (I 
didn’t dive deep into that, I guess I just need to calculated the new TMs will 
be offered with sufficient k8s resources if the rescale happens?)
I think as a user/operator, I am not supposed by to be able to 
recover/restarted a job from checkpoint [8].
I guess this might cause some performance loss since savepoints are more 
expensive and the Flink application must do both savepoint and checkpoint 
periodically… Is there any possible ways that user can also use checkpoints to 
restart and recover as a user? If Question 1 will be part of the future plan, I 
guess I won’t need much work here.

Reference:
[1] Reactive mode blog: https://flink.apache.org/2021/05/06/reactive-mode.html
[2] example usage of reactive scaling: 
https://github.com/rmetzger/flink-reactive-mode-k8s-demo
[3] FILP: 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-159%3A+Reactive+Mode
[4] Discussion thread: 
https://lists.apache.org/thread.html/ra688faf9dca036500f0445c55671e70ba96c70f942afe650e9db8374%40%3Cdev.flink.apache.org%3E
[5] Flink doc: 
https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/elastic_scaling/
[6] Flink Jira: 
https://issues.apache.org/jira/browse/FLINK-10407\<https://issues.apache.org/jira/browse/FLINK-10407/>
[7] https://github.com/wangyang0918/flink-native-k8s-operator
[8] 
https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/state/checkpoints/#difference-to-savepoints
[9] 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-138%3A+Declarative+Resource+management
[10] 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Adaptive+Scheduler

Thanks,
Fuyao

Reply via email to