Hi, Thanks for your comments, Gyula, I really appreciate it!
I have updated the following things in the FLIP, please comment on these changes if you have any suggestions or concerns: - Added path field to FlinkStateSnapshotReference - Added two examples at the bottom. - Added error handling section and the new fields associated ("backoffLimit" and "failures") to the interfaces. - Renamed field "completed" to "alreadyExists". Regarding the separate resources, I don't think that any of the two solutions would bring too much (dis)advantage to the table, so I am still neutral, and waiting for others to chime in as well with their thoughts and feedback! Regards, Mate Gyula Fóra <gyula.f...@gmail.com> ezt írta (időpont: 2024. ápr. 19., P, 21:43): > Hey! > > Regarding the question of initialSavepointPath and > flinkStateSnapshotReference new object, I think we could simply keep an > extra field as part of the flinkStateSnapshotReference object called path. > > Then the fields could be: > namespace, name, path > > If path is defined we would use that (to support the simple way also) > otherwise use the resource. I would still deprecate the > initialSavepointPath field in the jobSpec. > > Regarding the Savepoint/Checkpoint vs FlinkStateSnapshot. > What we need: > 1. Easy way to list all state snapshots (to select latest) > 2. Easy way to reference a savepoint/checkpoint from a jobspec > 3. Differentiate state snapshot types (in some cases users may prefer to > use checkpoint/savepoint for certain upgrades) -> we should add a label or > something for easy selection > 4. Be able to delete savepoints (and checkpoints maybe) > > I am personally slightly more in favor of having a single resource as that > ticks all the boxes, while having 2 separate resources will make both > listing and referencing harder. We would have to introduce state type in > the reference (name + namespace would not be enough to uniquely identify a > state snapshot) > > I wonder if I am missing any good argument against the single > FlinkStateSnapshot here. > > Cheers, > Gyula > > > On Fri, Apr 19, 2024 at 9:09 PM Mate Czagany <czmat...@gmail.com> wrote: > >> Hi Robert and Thomas, >> >> Thank you for sharing your thoughts, I will try to address your questions >> and suggestions: >> >> 1. I would really love to hear others' inputs as well about separating >> the snapshot CRD into two different CRDs instead for savepoints and >> checkpoints. I think the main upside is that we would not need the >> mandatory savepoint or checkpoint field inside the spec. The two CRs could >> share the same status fields, and their specs would be different. >> I personally like both solutions, and would love to hear others' thoughts >> as well. >> >> 2. I agree with you that "completed" is not very clear, but I would >> suggest the name "alreadyExists". WDYT? >> >> 3. I think having a retry loop inside the operator does not add too much >> complexity to the FLIP. On failure, we check if we have reached the max >> retries. If we did, the state will be set to "FAILED", else it will be set >> to "TRIGGER_PENDING", causing the operator to retry the task. The "error" >> field will always be populated with the latest error. Kubernetes Jobs >> already has a similar field called "backoffLimit", maybe we could use the >> same name, with the same logic applied, WDYT? >> About error events, I think we should keep the "error" field, and upon >> successful snapshot, we clear it. I will add to the FLIP that there will be >> an event generated for each unsuccessful snapshots. >> >> 4. I really like the idea of having something like Pod Conditions, but I >> think it wouldn't add too much value here, because the only 2 stages >> important to the user are "Triggered" and "Completed", and those timestamps >> will already be included in the status field. I think it would make more >> sense to implement this if there were more lifecycle stages. >> >> 5. There will be a new field in JobSpec called >> "flinkStateSnapshotReference" to reference a FlinkStateSnapshot to restore >> from. >> >> > How do you see potential effects on API server performance wrt. number >> of >> objects vs mutations? Is the proposal more or less neutral in that regard? >> >> While I am not an expert in Kubernetes internals, my understanding is >> that for the api-server, editing an existing resource or creating a new one >> is not different performance-wise, because the whole resource will always >> be written to etcd anyways. >> Retrieving the savepoints from etcd will be different though for some >> use-cases, e.g. retrieving all snapshots for a specific FlinkDeployment >> would require the api-server to retrieve every snapshots first in a >> namespace from etcd, then filter them for that specific FlinkDeployment. I >> think this is a worst-case scenario, and it will be up to the user to >> optimize their queries via e.g. watch queries [1] or resourceVersions [2]. >> >> > Does that mean one would have to create a FlinkStateSnapshot CR when >> starting a new deployment from savepoint? If so, that's rather >> complicated. >> I would prefer something more simple/concise and would rather >> keep initialSavepointPath >> >> Starting a job from a savepoint path will indeed be deprecated with this >> FLIP. I agree that it will be more complicated to restore from a savepoint >> in those cases, but if the user decides to move away from the deprecated >> savepoint mechanisms, every savepoint will result in a new >> FlinkStateSnapshot CR. So the only situation I expect this to be an >> inconvenience is when the user onboards a new Flink job to the operator. >> But I may not be thinking this through, so please let me know if you >> disagree. >> >> Thank you very much for your questions and suggestions! >> >> [1] >> https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes >> [2] >> https://kubernetes.io/docs/reference/using-api/api-concepts/#resource-versions >> >> Regards, >> Mate >> >> Thomas Weise <t...@apache.org> ezt írta (időpont: 2024. ápr. 19., P, >> 11:31): >> >>> Thanks for the proposal. >>> >>> How do you see potential effects on API server performance wrt. number of >>> objects vs mutations? Is the proposal more or less neutral in that >>> regard? >>> >>> Thanks for the thorough feedback Robert. >>> >>> Couple more questions below. >>> >>> --> >>> >>> On Fri, Apr 19, 2024 at 5:07 AM Robert Metzger <rmetz...@apache.org> >>> wrote: >>> >>> > Hi Mate, >>> > thanks for proposing this, I'm really excited about your FLIP. I hope >>> my >>> > questions make sense to you: >>> > >>> > 1. I would like to discuss the "FlinkStateSnapshot" name and the fact >>> that >>> > users have to use either the savepoint or checkpoint spec inside the >>> > FlinkStateSnapshot. >>> > Wouldn't it be more intuitive to introduce two CRs: >>> > FlinkSavepoint and FlinkCheckpoint >>> > Ideally they can internally share a lot of code paths, but from a users >>> > perspective, the abstraction is much clearer. >>> > >>> >>> There are probably pros and cons either way. For example it is desirable >>> to >>> have a single list of state snapshots when looking for the initial >>> savepoint for a new deployment etc. >>> >>> >>> > >>> > 2. I also would like to discuss SavepointSpec.completed, as this name >>> is >>> > not intuitive to me. How about "ignoreExisting"? >>> > >>> > 3. The FLIP proposal seems to leave error handling to the user, e.g. >>> when >>> > you create a FlinkStateSnapshot, it will just move to status FAILED. >>> > Typically in K8s with the control loop stuff, resources are tried to >>> get >>> > created until success. I think it would be really nice if the >>> > FlinkStateSnapshot or FlinkSavepoint resource would retry based on a >>> > property in the resource. A "FlinkStateSnapshot.retries" number would >>> > indicate how often the user wants the operator to retry creating a >>> > savepoint, "retries = -1" means retry forever. In addition, we could >>> > consider a timeout as well, however, I haven't seen such a concept in >>> K8s >>> > CRs yet. >>> > The benefit of this is that other tools relying on the K8s operator >>> > wouldn't have to implement this retry loop (which is quite natural for >>> > K8s), they would just have to wait for the CR they've created to >>> transition >>> > into COMPLETED: >>> > >>> > 3. FlinkStateSnapshotStatus.error will only show the last error. What >>> > about using Events, so that we can show multiple errors and use the >>> > FlinkStateSnapshotState to report errors? >>> > >>> > 4. I wonder if it makes sense to use something like Pod Conditions >>> (e.g. >>> > Savepoint Conditions): >>> > >>> https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-conditions >>> > to track the completion status. We could have the following conditions: >>> > - Triggered >>> > - Completed >>> > - Failed >>> > The only benefit of this proposal that I see is that it would tell the >>> > user how long it took to create the savepoint. >>> > >>> > 5. You mention that "JobSpec.initialSavepointPath" will be deprecated. >>> I >>> > assume we will introduce a new field for referencing a >>> FlinkStateSnapshot >>> > CR? I think it would be good to cover this in the FLIP. >>> > >>> > Does that mean one would have to create a FlinkStateSnapshot CR when >>> starting a new deployment from savepoint? If so, that's rather >>> complicated. >>> I would prefer something more simple/concise and would rather >>> keep initialSavepointPath >>> >>> >>> > >>> > One minor comment: >>> > >>> > "/** Dispose the savepoints upon CRD deletion. */" >>> > >>> > I think this should be "upon CR deletion", not "CRD deletion". >>> > >>> > Thanks again for this great FLIP! >>> > >>> > Best, >>> > Robert >>> > >>> > >>> > On Fri, Apr 19, 2024 at 9:01 AM Gyula Fóra <gyula.f...@gmail.com> >>> wrote: >>> > >>> >> Cc'ing some folks who gave positive feedback on this idea in the past. >>> >> >>> >> I would love to hear your thoughts on the proposed design >>> >> >>> >> Gyula >>> >> >>> >> On Tue, Apr 16, 2024 at 6:31 PM Őrhidi Mátyás < >>> matyas.orh...@gmail.com> >>> >> wrote: >>> >> >>> >>> +1 Looking forward to it >>> >>> >>> >>> On Tue, Apr 16, 2024 at 8:56 AM Mate Czagany <czmat...@gmail.com> >>> wrote: >>> >>> >>> >>> > Thank you Gyula! >>> >>> > >>> >>> > I think that is a great idea. I have updated the Google doc to only >>> >>> have 1 >>> >>> > new configuration option of boolean type, which can be used to >>> signal >>> >>> the >>> >>> > Operator to use the old mode. >>> >>> > >>> >>> > Also described in the configuration description, the Operator will >>> >>> fallback >>> >>> > to the old mode if the FlinkStateSnapshot CRD cannot be found on >>> the >>> >>> > Kubernetes cluster. >>> >>> > >>> >>> > Regards, >>> >>> > Mate >>> >>> > >>> >>> > Gyula Fóra <gyula.f...@gmail.com> ezt írta (időpont: 2024. ápr. >>> 16., >>> >>> K, >>> >>> > 17:01): >>> >>> > >>> >>> > > Thanks Mate, this is great stuff. >>> >>> > > >>> >>> > > Mate, I think the new configs should probably default to the new >>> >>> mode and >>> >>> > > they should only be useful for users to fall back to the old >>> >>> behaviour. >>> >>> > > We could by default use the new Snapshot CRD if the CRD is >>> installed, >>> >>> > > otherwise use the old mode by default and log a warning on >>> startup. >>> >>> > > >>> >>> > > So I am suggesting a "dynamic" default behaviour based on whether >>> >>> the new >>> >>> > > CRD was installed or not because we don't want to break operator >>> >>> startup. >>> >>> > > >>> >>> > > Gyula >>> >>> > > >>> >>> > > On Tue, Apr 16, 2024 at 4:48 PM Mate Czagany <czmat...@gmail.com >>> > >>> >>> wrote: >>> >>> > > >>> >>> > > > Hi Ferenc, >>> >>> > > > >>> >>> > > > Thank you for your comments, I have updated the Google docs >>> with a >>> >>> new >>> >>> > > > section for the new configs. >>> >>> > > > All of the newly added config keys will have defaults set, and >>> by >>> >>> > default >>> >>> > > > all the savepoint/checkpoint operations will use the old >>> system: >>> >>> write >>> >>> > > > their results to the FlinkDeployment/FlinkSessionJob status >>> field. >>> >>> > > > >>> >>> > > > I have also added a default for the checkpoint type to be FULL >>> >>> (which >>> >>> > is >>> >>> > > > also the default currently). That was an oversight on my part >>> to >>> >>> miss >>> >>> > > that. >>> >>> > > > >>> >>> > > > Regards, >>> >>> > > > Mate >>> >>> > > > >>> >>> > > > Ferenc Csaky <ferenc.cs...@pm.me.invalid> ezt írta (időpont: >>> 2024. >>> >>> > ápr. >>> >>> > > > 16., K, 16:10): >>> >>> > > > >>> >>> > > > > Thank you Mate for initiating this discussion. +1 for this >>> idea. >>> >>> > > > > Some Qs: >>> >>> > > > > >>> >>> > > > > Can you specify the newly introduced configurations in more >>> >>> > > > > details? Currently, it is not fully clear to me what are the >>> >>> > > > > possible values of >>> `kubernetes.operator.periodic.savepoint.mode`, >>> >>> > > > > is it optional, has a default value? >>> >>> > > > > >>> >>> > > > > I see that in `SavepointSpec.formatType` has a default, >>> although >>> >>> > > > > `CheckppointSpec.checkpointType` not. Are we inferring that >>> from >>> >>> > > > > the config? My point is, in general I think it would be good >>> to >>> >>> > > > > handle the two snapshot types in a similar way when it makes >>> >>> sense >>> >>> > > > > to minimize any kind of confusion. >>> >>> > > > > >>> >>> > > > > Best, >>> >>> > > > > Ferenc >>> >>> > > > > >>> >>> > > > > >>> >>> > > > > >>> >>> > > > > On Tuesday, April 16th, 2024 at 11:34, Mate Czagany < >>> >>> > > czmat...@gmail.com> >>> >>> > > > > wrote: >>> >>> > > > > >>> >>> > > > > > >>> >>> > > > > > >>> >>> > > > > > Hi Everyone, >>> >>> > > > > > >>> >>> > > > > > I would like to start a discussion on FLIP-446: Kubernetes >>> >>> Operator >>> >>> > > > State >>> >>> > > > > > Snapshot CRD. >>> >>> > > > > > >>> >>> > > > > > This FLIP adds a new custom resource for Operator users to >>> >>> create >>> >>> > and >>> >>> > > > > > manage their savepoints and checkpoints. I have also >>> developed >>> >>> an >>> >>> > > > initial >>> >>> > > > > > POC to prove that this approach is feasible, you can find >>> the >>> >>> link >>> >>> > > for >>> >>> > > > > that >>> >>> > > > > > in the FLIP. >>> >>> > > > > > >>> >>> > > > > > There is a Confluence page [1] and a Google Docs page [2] >>> as I >>> >>> do >>> >>> > not >>> >>> > > > > have >>> >>> > > > > > a Confluence account yet. >>> >>> > > > > > >>> >>> > > > > > [1] >>> >>> > > > > > >>> >>> > > > > >>> >>> > > > >>> >>> > > >>> >>> > >>> >>> >>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-446%3A+Kubernetes+Operator+State+Snapshot+CRD >>> >>> > > > > > [2] >>> >>> > > > > > >>> >>> > > > > >>> >>> > > > >>> >>> > > >>> >>> > >>> >>> >>> https://docs.google.com/document/d/1VdfLFaE4i6ESbCQ38CH7TKOiPQVvXeOxNV2FeSMnOTg >>> >>> > > > > > >>> >>> > > > > > >>> >>> > > > > > Regards, >>> >>> > > > > > Mate >>> >>> > > > > >>> >>> > > > >>> >>> > > >>> >>> > >>> >>> >>> >> >>> >>