Re: [DISCUSS] FLIP-446: Kubernetes Operator State Snapshot CRD

Mate Czagany Sat, 20 Apr 2024 07:31:25 -0700

Hi,

Thanks for your comments, Gyula, I really appreciate it!


I have updated the following things in the FLIP, please comment on these
changes if you have any suggestions or concerns:
- Added path field to FlinkStateSnapshotReference
- Added two examples at the bottom.
- Added error handling section and the new fields associated
("backoffLimit" and "failures") to the interfaces.
- Renamed field "completed" to "alreadyExists".

Regarding the separate resources, I don't think that any of the two
solutions would bring too much (dis)advantage to the table, so I am still
neutral, and waiting for others to chime in as well with their thoughts and
feedback!

Regards,
Mate


Gyula Fóra <[email protected]> ezt írta (időpont: 2024. ápr. 19., P,
21:43):

> Hey!
>
> Regarding the question of initialSavepointPath and
> flinkStateSnapshotReference new object, I think we could simply keep an
> extra field as part of the flinkStateSnapshotReference object called path.
>
> Then the fields could be:
> namespace, name, path
>
> If path is defined we would use that (to support the simple way also)
> otherwise use the resource. I would still deprecate the
> initialSavepointPath field in the jobSpec.
>
> Regarding the Savepoint/Checkpoint vs FlinkStateSnapshot.
> What we need:
>  1. Easy way to list all state snapshots (to select latest)
>  2. Easy way to reference a savepoint/checkpoint from a jobspec
>  3. Differentiate state snapshot types (in some cases users may prefer to
> use checkpoint/savepoint for certain upgrades) -> we should add a label or
> something for easy selection
>  4. Be able to delete savepoints (and checkpoints maybe)
>
> I am personally slightly more in favor of having a single resource as that
> ticks all the boxes, while having 2 separate resources will make both
> listing and referencing harder. We would have to introduce state type in
> the reference (name + namespace would not be enough to uniquely identify a
> state snapshot)
>
> I wonder if I am missing any good argument against the single
> FlinkStateSnapshot here.
>
> Cheers,
> Gyula
>
>
> On Fri, Apr 19, 2024 at 9:09 PM Mate Czagany <[email protected]> wrote:
>
>> Hi Robert and Thomas,
>>
>> Thank you for sharing your thoughts, I will try to address your questions
>> and suggestions:
>>
>> 1. I would really love to hear others' inputs as well about separating
>> the snapshot CRD into two different CRDs instead for savepoints and
>> checkpoints. I think the main upside is that we would not need the
>> mandatory savepoint or checkpoint field inside the spec. The two CRs could
>> share the same status fields, and their specs would be different.
>> I personally like both solutions, and would love to hear others' thoughts
>> as well.
>>
>> 2. I agree with you that "completed" is not very clear, but I would
>> suggest the name "alreadyExists". WDYT?
>>
>> 3. I think having a retry loop inside the operator does not add too much
>> complexity to the FLIP. On failure, we check if we have reached the max
>> retries. If we did, the state will be set to "FAILED", else it will be set
>> to "TRIGGER_PENDING", causing the operator to retry the task. The "error"
>> field will always be populated with the latest error. Kubernetes Jobs
>> already has a similar field called "backoffLimit", maybe we could use the
>> same name, with the same logic applied, WDYT?
>> About error events, I think we should keep the "error" field, and upon
>> successful snapshot, we clear it. I will add to the FLIP that there will be
>> an event generated for each unsuccessful snapshots.
>>
>> 4. I really like the idea of having something like Pod Conditions, but I
>> think it wouldn't add too much value here, because the only 2 stages
>> important to the user are "Triggered" and "Completed", and those timestamps
>> will already be included in the status field. I think it would make more
>> sense to implement this if there were more lifecycle stages.
>>
>> 5. There will be a new field in JobSpec called
>> "flinkStateSnapshotReference" to reference a FlinkStateSnapshot to restore
>> from.
>>
>> > How do you see potential effects on API server performance wrt. number
>> of
>> objects vs mutations? Is the proposal more or less neutral in that regard?
>>
>> While I am not an expert in Kubernetes internals, my understanding is
>> that for the api-server, editing an existing resource or creating a new one
>> is not different performance-wise, because the whole resource will always
>> be written to etcd anyways.
>> Retrieving the savepoints from etcd will be different though for some
>> use-cases, e.g. retrieving all snapshots for a specific FlinkDeployment
>> would require the api-server to retrieve every snapshots first in a
>> namespace from etcd, then filter them for that specific FlinkDeployment. I
>> think this is a worst-case scenario, and it will be up to the user to
>> optimize their queries via e.g. watch queries [1] or resourceVersions [2].
>>
>> > Does that mean one would have to create a FlinkStateSnapshot CR when
>> starting a new deployment from savepoint? If so, that's rather
>> complicated.
>> I would prefer something more simple/concise and would rather
>> keep initialSavepointPath
>>
>> Starting a job from a savepoint path will indeed be deprecated with this
>> FLIP. I agree that it will be more complicated to restore from a savepoint
>> in those cases, but if the user decides to move away from the deprecated
>> savepoint mechanisms, every savepoint will result in a new
>> FlinkStateSnapshot CR. So the only situation I expect this to be an
>> inconvenience is when the user onboards a new Flink job to the operator.
>> But I may not be thinking this through, so please let me know if you
>> disagree.
>>
>> Thank you very much for your questions and suggestions!
>>
>> [1]
>> https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes
>> [2]
>> https://kubernetes.io/docs/reference/using-api/api-concepts/#resource-versions
>>
>> Regards,
>> Mate
>>
>> Thomas Weise <[email protected]> ezt írta (időpont: 2024. ápr. 19., P,
>> 11:31):
>>
>>> Thanks for the proposal.
>>>
>>> How do you see potential effects on API server performance wrt. number of
>>> objects vs mutations? Is the proposal more or less neutral in that
>>> regard?
>>>
>>> Thanks for the thorough feedback Robert.
>>>
>>> Couple more questions below.
>>>
>>> -->
>>>
>>> On Fri, Apr 19, 2024 at 5:07 AM Robert Metzger <[email protected]>
>>> wrote:
>>>
>>> > Hi Mate,
>>> > thanks for proposing this, I'm really excited about your FLIP. I hope
>>> my
>>> > questions make sense to you:
>>> >
>>> > 1. I would like to discuss the "FlinkStateSnapshot" name and the fact
>>> that
>>> > users have to use either the savepoint or checkpoint spec inside the
>>> > FlinkStateSnapshot.
>>> > Wouldn't it be more intuitive to introduce two CRs:
>>> > FlinkSavepoint and FlinkCheckpoint
>>> > Ideally they can internally share a lot of code paths, but from a users
>>> > perspective, the abstraction is much clearer.
>>> >
>>>
>>> There are probably pros and cons either way. For example it is desirable
>>> to
>>> have a single list of state snapshots when looking for the initial
>>> savepoint for a new deployment etc.
>>>
>>>
>>> >
>>> > 2. I also would like to discuss SavepointSpec.completed, as this name
>>> is
>>> > not intuitive to me. How about "ignoreExisting"?
>>> >
>>> > 3. The FLIP proposal seems to leave error handling to the user, e.g.
>>> when
>>> > you create a FlinkStateSnapshot, it will just move to status FAILED.
>>> > Typically in K8s with the control loop stuff, resources are tried to
>>> get
>>> > created until success. I think it would be really nice if the
>>> > FlinkStateSnapshot or FlinkSavepoint resource would retry based on a
>>> > property in the resource. A "FlinkStateSnapshot.retries" number would
>>> > indicate how often the user wants the operator to retry creating a
>>> > savepoint, "retries = -1" means retry forever. In addition, we could
>>> > consider a timeout as well, however, I haven't seen such a concept in
>>> K8s
>>> > CRs yet.
>>> > The benefit of this is that other tools relying on the K8s operator
>>> > wouldn't have to implement this retry loop (which is quite natural for
>>> > K8s), they would just have to wait for the CR they've created to
>>> transition
>>> > into COMPLETED:
>>> >
>>> > 3. FlinkStateSnapshotStatus.error will only show the last error. What
>>> > about using Events, so that we can show multiple errors and use the
>>> > FlinkStateSnapshotState to report errors?
>>> >
>>> > 4. I wonder if it makes sense to use something like Pod Conditions
>>> (e.g.
>>> > Savepoint Conditions):
>>> >
>>> https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-conditions
>>> > to track the completion status. We could have the following conditions:
>>> > - Triggered
>>> > - Completed
>>> > - Failed
>>> > The only benefit of this proposal that I see is that it would tell the
>>> > user how long it took to create the savepoint.
>>> >
>>> > 5. You mention that "JobSpec.initialSavepointPath" will be deprecated.
>>> I
>>> > assume we will introduce a new field for referencing a
>>> FlinkStateSnapshot
>>> > CR? I think it would be good to cover this in the FLIP.
>>> >
>>> > Does that mean one would have to create a FlinkStateSnapshot CR when
>>> starting a new deployment from savepoint? If so, that's rather
>>> complicated.
>>> I would prefer something more simple/concise and would rather
>>> keep initialSavepointPath
>>>
>>>
>>> >
>>> > One minor comment:
>>> >
>>> > "/** Dispose the savepoints upon CRD deletion. */"
>>> >
>>> > I think this should be "upon CR deletion", not "CRD deletion".
>>> >
>>> > Thanks again for this great FLIP!
>>> >
>>> > Best,
>>> > Robert
>>> >
>>> >
>>> > On Fri, Apr 19, 2024 at 9:01 AM Gyula Fóra <[email protected]>
>>> wrote:
>>> >
>>> >> Cc'ing some folks who gave positive feedback on this idea in the past.
>>> >>
>>> >> I would love to hear your thoughts on the proposed design
>>> >>
>>> >> Gyula
>>> >>
>>> >> On Tue, Apr 16, 2024 at 6:31 PM Őrhidi Mátyás <
>>> [email protected]>
>>> >> wrote:
>>> >>
>>> >>> +1 Looking forward to it
>>> >>>
>>> >>> On Tue, Apr 16, 2024 at 8:56 AM Mate Czagany <[email protected]>
>>> wrote:
>>> >>>
>>> >>> > Thank you Gyula!
>>> >>> >
>>> >>> > I think that is a great idea. I have updated the Google doc to only
>>> >>> have 1
>>> >>> > new configuration option of boolean type, which can be used to
>>> signal
>>> >>> the
>>> >>> > Operator to use the old mode.
>>> >>> >
>>> >>> > Also described in the configuration description, the Operator will
>>> >>> fallback
>>> >>> > to the old mode if the FlinkStateSnapshot CRD cannot be found on
>>> the
>>> >>> > Kubernetes cluster.
>>> >>> >
>>> >>> > Regards,
>>> >>> > Mate
>>> >>> >
>>> >>> > Gyula Fóra <[email protected]> ezt írta (időpont: 2024. ápr.
>>> 16.,
>>> >>> K,
>>> >>> > 17:01):
>>> >>> >
>>> >>> > > Thanks Mate, this is great stuff.
>>> >>> > >
>>> >>> > > Mate, I think the new configs should probably default to the new
>>> >>> mode and
>>> >>> > > they should only be useful for users to fall back to the old
>>> >>> behaviour.
>>> >>> > > We could by default use the new Snapshot CRD if the CRD is
>>> installed,
>>> >>> > > otherwise use the old mode by default and log a warning on
>>> startup.
>>> >>> > >
>>> >>> > > So I am suggesting a "dynamic" default behaviour based on whether
>>> >>> the new
>>> >>> > > CRD was installed or not because we don't want to break operator
>>> >>> startup.
>>> >>> > >
>>> >>> > > Gyula
>>> >>> > >
>>> >>> > > On Tue, Apr 16, 2024 at 4:48 PM Mate Czagany <[email protected]
>>> >
>>> >>> wrote:
>>> >>> > >
>>> >>> > > > Hi Ferenc,
>>> >>> > > >
>>> >>> > > > Thank you for your comments, I have updated the Google docs
>>> with a
>>> >>> new
>>> >>> > > > section for the new configs.
>>> >>> > > > All of the newly added config keys will have defaults set, and
>>> by
>>> >>> > default
>>> >>> > > > all the savepoint/checkpoint operations will use the old
>>> system:
>>> >>> write
>>> >>> > > > their results to the FlinkDeployment/FlinkSessionJob status
>>> field.
>>> >>> > > >
>>> >>> > > > I have also added a default for the checkpoint type to be FULL
>>> >>> (which
>>> >>> > is
>>> >>> > > > also the default currently). That was an oversight on my part
>>> to
>>> >>> miss
>>> >>> > > that.
>>> >>> > > >
>>> >>> > > > Regards,
>>> >>> > > > Mate
>>> >>> > > >
>>> >>> > > > Ferenc Csaky <[email protected]> ezt írta (időpont:
>>> 2024.
>>> >>> > ápr.
>>> >>> > > > 16., K, 16:10):
>>> >>> > > >
>>> >>> > > > > Thank you Mate for initiating this discussion. +1 for this
>>> idea.
>>> >>> > > > > Some Qs:
>>> >>> > > > >
>>> >>> > > > > Can you specify the newly introduced configurations in more
>>> >>> > > > > details? Currently, it is not fully clear to me what are the
>>> >>> > > > > possible values of
>>> `kubernetes.operator.periodic.savepoint.mode`,
>>> >>> > > > > is it optional, has a default value?
>>> >>> > > > >
>>> >>> > > > > I see that in `SavepointSpec.formatType` has a default,
>>> although
>>> >>> > > > > `CheckppointSpec.checkpointType` not. Are we inferring that
>>> from
>>> >>> > > > > the config? My point is, in general I think it would be good
>>> to
>>> >>> > > > > handle the two snapshot types in a similar way when it makes
>>> >>> sense
>>> >>> > > > > to minimize any kind of confusion.
>>> >>> > > > >
>>> >>> > > > > Best,
>>> >>> > > > > Ferenc
>>> >>> > > > >
>>> >>> > > > >
>>> >>> > > > >
>>> >>> > > > > On Tuesday, April 16th, 2024 at 11:34, Mate Czagany <
>>> >>> > > [email protected]>
>>> >>> > > > > wrote:
>>> >>> > > > >
>>> >>> > > > > >
>>> >>> > > > > >
>>> >>> > > > > > Hi Everyone,
>>> >>> > > > > >
>>> >>> > > > > > I would like to start a discussion on FLIP-446: Kubernetes
>>> >>> Operator
>>> >>> > > > State
>>> >>> > > > > > Snapshot CRD.
>>> >>> > > > > >
>>> >>> > > > > > This FLIP adds a new custom resource for Operator users to
>>> >>> create
>>> >>> > and
>>> >>> > > > > > manage their savepoints and checkpoints. I have also
>>> developed
>>> >>> an
>>> >>> > > > initial
>>> >>> > > > > > POC to prove that this approach is feasible, you can find
>>> the
>>> >>> link
>>> >>> > > for
>>> >>> > > > > that
>>> >>> > > > > > in the FLIP.
>>> >>> > > > > >
>>> >>> > > > > > There is a Confluence page [1] and a Google Docs page [2]
>>> as I
>>> >>> do
>>> >>> > not
>>> >>> > > > > have
>>> >>> > > > > > a Confluence account yet.
>>> >>> > > > > >
>>> >>> > > > > > [1]
>>> >>> > > > > >
>>> >>> > > > >
>>> >>> > > >
>>> >>> > >
>>> >>> >
>>> >>>
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-446%3A+Kubernetes+Operator+State+Snapshot+CRD
>>> >>> > > > > > [2]
>>> >>> > > > > >
>>> >>> > > > >
>>> >>> > > >
>>> >>> > >
>>> >>> >
>>> >>>
>>> https://docs.google.com/document/d/1VdfLFaE4i6ESbCQ38CH7TKOiPQVvXeOxNV2FeSMnOTg
>>> >>> > > > > >
>>> >>> > > > > >
>>> >>> > > > > > Regards,
>>> >>> > > > > > Mate
>>> >>> > > > >
>>> >>> > > >
>>> >>> > >
>>> >>> >
>>> >>>
>>> >>
>>>
>>

Re: [DISCUSS] FLIP-446: Kubernetes Operator State Snapshot CRD

Reply via email to