Hi everyone, just wanted to mention that my employer agreed to open source the PoC I developed: https://github.com/MicroFocus/opsb-flink-k8s-operator
I understand the concern for maintainability, so Gradle & Kotlin might not be appealing to you, but at least it gives you another reference. The Helm resources in particular might be useful. There are bits and pieces there referring to Flink sessions, but those are just placeholders, the functioning parts use application mode with native integration. Regards, Alexis. ________________________________ From: Thomas Weise <t...@apache.org> Sent: Saturday, February 5, 2022 2:41 AM To: dev <dev@flink.apache.org> Subject: Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator Hi, Thanks for the continued feedback and discussion. Looks like we are ready to start a VOTE, I will initiate it shortly. In parallel it would be good to find the repository name. My suggestion would be: flink-kubernetes-operator I thought "flink-operator" could be a bit misleading since the term operator already has a meaning in Flink. I also considered "flink-k8s-operator" but that would be almost identical to existing operator implementations and could lead to confusion in the future. Thoughts? Thanks, Thomas On Fri, Feb 4, 2022 at 5:15 AM Gyula Fóra <gyula.f...@gmail.com> wrote: > > Hi Danny, > > So far we have been focusing our dev efforts on the initial native > implementation with the team. > If the discussion and vote goes well for this FLIP we are looking forward > to contributing the initial version sometime next week (fingers crossed). > > At that point I think we can already start the dev work to support the > standalone mode as well, especially if you can dedicate some effort to > pushing that side. > Working together on this sounds like a great idea and we should start as > soon as possible! :) > > Cheers, > Gyula > > On Fri, Feb 4, 2022 at 2:07 PM Danny Cranmer <dannycran...@apache.org> > wrote: > > > I have been discussing this one with my team. We are interested in the > > Standalone mode, and are willing to contribute towards the implementation. > > Potentially we can work together to support both modes in parallel? > > > > Thanks, > > > > On Wed, Feb 2, 2022 at 4:02 PM Gyula Fóra <gyula.f...@gmail.com> wrote: > > > > > Hi Danny! > > > > > > Thanks for the feedback :) > > > > > > Versioning: > > > Versioning will be independent from Flink and the operator will depend > > on a > > > fixed flink version (in every given operator version). > > > This should be the exact same setup as with Stateful Functions ( > > > https://github.com/apache/flink-statefun). So independent release cycle > > > but > > > still within the Flink umbrella. > > > > > > Deployment error handling: > > > I think that's a very good point, as general exception handling for the > > > different failure scenarios is a tricky problem. I think the exception > > > classifiers and retry strategies could avoid a lot of manual intervention > > > from the user. We will definitely need to add something like this. Once > > we > > > have the repo created with the initial operator code we should open some > > > tickets for this and put it on the short term roadmap! > > > > > > Cheers, > > > Gyula > > > > > > On Wed, Feb 2, 2022 at 4:50 PM Danny Cranmer <dannycran...@apache.org> > > > wrote: > > > > > > > Hey team, > > > > > > > > Great work on the FLIP, I am looking forward to this one. I agree that > > we > > > > can move forward to the voting stage. > > > > > > > > I have general feedback around how we will handle job submission > > failure > > > > and retry. As discussed in the Rejected Alternatives section, we can > > use > > > > Java to handle job submission failures from the Flink client. It would > > be > > > > useful to have the ability to configure exception classifiers and retry > > > > strategy as part of operator configuration. > > > > > > > > Given this will be in a separate Github repository I am curious how > > ther > > > > versioning strategy will work in relation to the Flink version? Do we > > > have > > > > any other components with a similar setup I can look at? Will the > > > operator > > > > version track Flink or will it use its own versioning strategy with a > > > Flink > > > > version support matrix, or similar? > > > > > > > > Thanks, > > > > > > > > > > > > > > > > On Tue, Feb 1, 2022 at 2:33 PM Márton Balassi < > > balassi.mar...@gmail.com> > > > > wrote: > > > > > > > > > Hi team, > > > > > > > > > > Thank you for the great feedback, Thomas has updated the FLIP page > > > > > accordingly. If you are comfortable with the currently existing > > design > > > > and > > > > > depth in the FLIP [1] I suggest moving forward to the voting stage - > > > once > > > > > that reaches a positive conclusion it lets us create the separate > > code > > > > > repository under the flink project for the operator. > > > > > > > > > > I encourage everyone to keep improving the details in the meantime, > > > > however > > > > > I believe given the existing design and the general sentiment on this > > > > > thread that the most efficient path from here is starting the > > > > > implementation so that we can collectively iterate over it. > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduce+Flink+Kubernetes+Operator > > > > > > > > > > On Mon, Jan 31, 2022 at 10:15 PM Thomas Weise <t...@apache.org> > > wrote: > > > > > > > > > > > HI Xintong, > > > > > > > > > > > > Thanks for the feedback and please see responses below --> > > > > > > > > > > > > On Fri, Jan 28, 2022 at 12:21 AM Xintong Song < > > tonysong...@gmail.com > > > > > > > > > > wrote: > > > > > > > > > > > > > Thanks Thomas for drafting this FLIP, and everyone for the > > > > discussion. > > > > > > > > > > > > > > I also have a few questions and comments. > > > > > > > > > > > > > > ## Job Submission > > > > > > > Deploying a Flink session cluster via kubectl & CR and then > > > > submitting > > > > > > jobs > > > > > > > to the cluster via Flink cli / REST is probably the approach that > > > > > > requires > > > > > > > the least effort. However, I'd like to point out 2 weaknesses. > > > > > > > 1. A lot of users use Flink in perjob/application modes. For > > these > > > > > users, > > > > > > > having to run the job in two steps (deploy the cluster, and > > submit > > > > the > > > > > > job) > > > > > > > is not that convenient. > > > > > > > 2. One of our motivations is being able to manage Flink > > > applications' > > > > > > > lifecycles with kubectl. Submitting jobs from cli sounds not > > > aligned > > > > > with > > > > > > > this motivation. > > > > > > > I think it's probably worth it to support submitting jobs via > > > > kubectl & > > > > > > CR > > > > > > > in the first version, both together with deploying the cluster > > like > > > > in > > > > > > > perjob/application mode and after deploying the cluster like in > > > > session > > > > > > > mode. > > > > > > > > > > > > > > > > > > > The intention is to support application management through operator > > > and > > > > > CR, > > > > > > which means there won't be any 2 step submission process, which as > > > you > > > > > > allude to would defeat the purpose of this project. The CR example > > > > shows > > > > > > the application part. Please note that the bare cluster support is > > an > > > > > > *additional* feature for scenarios that require external job > > > > management. > > > > > Is > > > > > > there anything on the FLIP page that creates a different > > impression? > > > > > > > > > > > > > > > > > > > > > > > > > > ## Versioning > > > > > > > Which Flink versions does the operator plan to support? > > > > > > > 1. Native K8s deployment was firstly introduced in Flink 1.10 > > > > > > > 2. Native K8s HA was introduced in Flink 1.12 > > > > > > > 3. The Pod template support was introduced in Flink 1.13 > > > > > > > 4. There was some changes to the Flink docker image entrypoint > > > script > > > > > in, > > > > > > > IIRC, Flink 1.13 > > > > > > > > > > > > > > > > > > > Great, thanks for providing this. It is important for the > > > compatibility > > > > > > going forward also. We are targeting Flink 1.14.x upwards. Before > > the > > > > > > operator is ready there will be another Flink release. Let's see if > > > > > anyone > > > > > > is interested in earlier versions? > > > > > > > > > > > > > > > > > > > > > > > > > > ## Compatibility > > > > > > > What kind of API compatibility we can commit to? It's probably > > fine > > > > to > > > > > > have > > > > > > > alpha / beta version APIs that allow incompatible future changes > > > for > > > > > the > > > > > > > first version. But eventually we would need to guarantee > > backwards > > > > > > > compatibility, so that an early version CR can work with a new > > > > version > > > > > > > operator. > > > > > > > > > > > > > > > > > > > Another great point and please let me include that on the FLIP > > page. > > > > ;-) > > > > > > > > > > > > I think we should allow incompatible changes for the first one or > > two > > > > > > versions, similar to how other major features have evolved > > recently, > > > > such > > > > > > as FLIP-27. > > > > > > > > > > > > Would be great to get broader feedback on this one. > > > > > > > > > > > > Cheers, > > > > > > Thomas > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Jan 28, 2022 at 1:18 PM Thomas Weise <t...@apache.org> > > > wrote: > > > > > > > > > > > > > > > Thanks for the feedback! > > > > > > > > > > > > > > > > > > > > > > > > > > # 1 Flink Native vs Standalone integration > > > > > > > > > Maybe we should make this more clear in the FLIP but we > > agreed > > > to > > > > > do > > > > > > > the > > > > > > > > > first version of the operator based on the native > > integration. > > > > > > > > > While this clearly does not cover all use-cases and > > > requirements, > > > > > it > > > > > > > > seems > > > > > > > > > this would lead to a much smaller initial effort and a nicer > > > > first > > > > > > > > version. > > > > > > > > > > > > > > > > > > > > > > > > > I'm also leaning towards the native integration, as long as it > > > > > reduces > > > > > > > the > > > > > > > > MVP effort. Ultimately the operator will need to also support > > the > > > > > > > > standalone mode. I would like to gain more confidence that > > native > > > > > > > > integration reduces the effort. While it cuts the effort to > > > handle > > > > > the > > > > > > TM > > > > > > > > pod creation, some mapping code from the CR to the native > > > > integration > > > > > > > > client and config needs to be created. As mentioned in the > > FLIP, > > > > > native > > > > > > > > integration requires the Flink job manager to have access to > > the > > > > k8s > > > > > > API > > > > > > > to > > > > > > > > create pods, which in some scenarios may be seen as > > unfavorable. > > > > > > > > > > > > > > > > > > > # Pod Template > > > > > > > > > > > Is the pod template in CR same with what Flink has > > already > > > > > > > > > supported[4]? > > > > > > > > > > > Then I am afraid not the arbitrary field(e.g. cpu/memory > > > > > > resources) > > > > > > > > > could > > > > > > > > > > > take effect. > > > > > > > > > > > > > > > > Yes, pod template would look almost identical. There are a few > > > > > settings > > > > > > > > that the operator will control (and that may need to be > > > > blacklisted), > > > > > > but > > > > > > > > in general we would not want to place restrictions. I think a > > > > > mechanism > > > > > > > > where a pod template is merged from multiple layers would also > > be > > > > > > > > interesting to make this more flexible. > > > > > > > > > > > > > > > > Cheers, > > > > > > > > Thomas > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >