Hi Fuyao, About your second question: You are right that taking and restoring from savepoints will incur a performance loss. They cannot be incremental, and cannot use native (low-level) data formats - for now. These issues are on the list of things to improve for Flink 1.15, so if the changes make it into the release, it may improve a lot. You can restore a job from a retained checkpoint (provided you configured retained checkpoints, else they are deleted on job cancellation), see [1] (right below the part you linked). It should be possible to rescale using a retained checkpoint, despite the docs suggesting otherwise (it was uncertain whether this guarantee should/can be given, so it was not stated in the docs. This is also expected to change in the future as it is a necessity for further reactive mode development).
Best, Nico [1] https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/state/checkpoints/#resuming-from-a-retained-checkpoint On Wed, Oct 27, 2021 at 11:56 PM Fuyao Li <fuyao...@oracle.com> wrote: > Hello Community, > > > > I am checking the reactive mode for Flink deployment. I noticed that this > is supported in Kubernetes environment, but only for standalone Kubernetes > as of now. I have read some previous discussion threads regarding this > issue. See [1][2][3][4][5][6]. > > > > Question 1: > > It seems that due to some interface and design considerations [4] > mentioned by Robert and Xintong and official doc[5], this feature is only > for standalone k8s and it is not available for native Kubernetes now. > However, I believe in theory, it is possible to be added to native > Kubernetes, right? Will this be part of the future plan? If not, what is > the restriction and is it a hard restriction? > > > > Question 2: > > I have built an native Kubernetes operator on top of Yang’s work [7] > supporting various state transfers in native k8s application mode and > session mode. Right now, I am seeking for adding some similar features like > reactive scaling for native k8s. From my perspective, what I can do is to > enable periodic savepoints and scale up/down based certain metrics we > collect inside the Flink application. Some additional resource > considerations need to be added to implement such feature, similar to the > adaptive scheduler concept in [9][10] (I didn’t dive deep into that, I > guess I just need to calculated the new TMs will be offered with sufficient > k8s resources if the rescale happens?) > > I think as a user/operator, I am not supposed by to be able to > recover/restarted a job from checkpoint [8]. > > I guess this might cause some performance loss since savepoints are more > expensive and the Flink application must do both savepoint and checkpoint > periodically… Is there any possible ways that user can also use checkpoints > to restart and recover as a user? If Question 1 will be part of the future > plan, I guess I won’t need much work here. > > > > Reference: > > [1] Reactive mode blog: > https://flink.apache.org/2021/05/06/reactive-mode.html > > [2] example usage of reactive scaling: > https://github.com/rmetzger/flink-reactive-mode-k8s-demo > > [3] FILP: > https://cwiki.apache.org/confluence/display/FLINK/FLIP-159%3A+Reactive+Mode > > [4] Discussion thread: > https://lists.apache.org/thread.html/ra688faf9dca036500f0445c55671e70ba96c70f942afe650e9db8374%40%3Cdev.flink.apache.org%3E > > [5] Flink doc: > https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/elastic_scaling/ > > [6] Flink Jira: https://issues.apache.org/jira/browse/FLINK-10407\ > <https://issues.apache.org/jira/browse/FLINK-10407/> > > [7] https://github.com/wangyang0918/flink-native-k8s-operator > > [8] > https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/state/checkpoints/#difference-to-savepoints > > [9] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-138%3A+Declarative+Resource+management > > [10] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Adaptive+Scheduler > > > > Thanks, > > Fuyao >