Re: [DISCUSS] FLIP-474: Store operator name and UID in state metadata

Zakelly Lan Mon, 26 Aug 2024 19:31:18 -0700

Hi Gabor,

Thanks for clarifying! It's important that we share the same vision.



Best,
Zakelly

On Mon, Aug 26, 2024 at 3:56 PM Gabor Somogyi <gabor.g.somo...@gmail.com>
wrote:

> Hi All,
>
> Thanks for the contributions!
> The asked umbrella document helped to resolve all the misunderstandings and
> align to a common result.
>
> As a result I think we're ready for the voting thread.
>
> BR,
> G
>
>
> On Mon, Aug 19, 2024 at 8:47 AM Gabor Somogyi <gabor.g.somo...@gmail.com>
> wrote:
>
> > Hi All,
> >
> > Based on our agreement I've created a draft Flink state observability
> > umbrella [1].
> > Please share your comments. It contains some details to give some
> insights
> > but the focus would be on the direction.
> >
> > [1]
> >
> https://docs.google.com/document/d/1Du1-TShoOjaNDCahs3sgLWIpYkXzJPdSkgHcWLpELyw/edit
> >
> > BR,
> > G
> >
> >
> > On Sat, Aug 10, 2024 at 10:54 AM Zakelly Lan <zakelly....@gmail.com>
> > wrote:
> >
> >> Hi Gabor,
> >>
> >> I apologize for any confusion. Let me clarify my position.
> >>
> >> The concept of state observability is important for users, and the
> current
> >> FLIP seems to be a step in the right direction. However, before we
> >> proceed,
> >> I suggest we discuss the final presentation of the state observability
> to
> >> the user and consider the high-level vision for achieving this. It's
> >> essential to ensure that the current FLIP aligns with the overall
> >> objective. I'm not suggesting a comprehensive FLIP to address all the
> >> missing pieces, and one FLIP for each piece is fine for me. I just want
> to
> >> ensure that we are on the same page in terms of vision. The last thing I
> >> want is a fragmented approach resulting in refactoring or deprecation of
> >> code when we need a complete feature.
> >>
> >> Actually, I would hesitate about the current proposal of adding uid *in*
> >> the state metadata. It may cause state incompatibility issues across
> >> versions. In theory we can do this but it is better not if we are adding
> >> data not for fault tolerance but only for human readability. And it
> could
> >> be worse if we add one or two columns sporadically in future.
> >>
> >> In fact, I expect the state metadata store to exist next to the
> checkpoint
> >> metadata, rather than within it. This gives us enough flexibility to
> >> polish
> >> this function as users need it, and without breaking checkpoint
> >> compatibility too often. Or moreover we don't have to stick to the form
> of
> >> checkpoint and we could choose a more human readable format like json
> for
> >> the metadata store. This is where I think this FLIP is inconsistent with
> >> my
> >> expectation of the state observability approach. These considerations
> >> deserve a discussion before proceeding with other details. WDTY?
> >>
> >>
> >> Best,
> >> Zakelly
> >>
> >> On Fri, Aug 9, 2024 at 8:22 PM Gabor Somogyi <gabor.g.somo...@gmail.com
> >
> >> wrote:
> >>
> >> > Hi David,
> >> >
> >> > Thanks for sharing your thoughts!
> >> >
> >> > > It sounds like you might already have an end-to-end solution in
> mind.
> >> It
> >> > would be really helpful if you could put that into writing so we can
> all
> >> > align our thinking.
> >> >
> >> > It makes sense to create a high level vision.
> >> >
> >> > > I’m not a fan of the mindset of “this is how it was done in Spark,
> so
> >> > we’ll
> >> > just replicate it” without proper discussion. We’ve had similar
> >> > conversations before.
> >> >
> >> > I think we've had this conversation already in case of delegation
> token
> >> > framework
> >> > and I can say the same. No intention to take over things blindly but
> >> it's
> >> > not a shame
> >> > to be inspired by solutions which are welcome by users.
> >> > The intention is similar just like in scalable authentication area
> where
> >> > Flink is now ahead of Spark.
> >> >
> >> > > Would it be too much to ask for a FLIP that outlines the overall
> >> vision
> >> > (without delving too deeply into the details) to ensure everyone is
> >> aligned
> >> > and moving in the same direction?
> >> >
> >> > That's a fair point and a constructive way how we can proceed.
> >> > I'm going to come back with the details...
> >> >
> >> > BR,
> >> > G
> >> >
> >> >
> >> > On Fri, Aug 9, 2024 at 1:36 PM David Morávek <d...@apache.org> wrote:
> >> >
> >> > > Hi Gabor,
> >> > >
> >> > > Thanks for taking the initiative on this. It’s clear that
> significant
> >> > > improvements are needed in this area, and parsing state files can be
> >> > > incredibly challenging, even for those who are well-versed in it.
> >> > >
> >> > > > Just to make it crystal clear, I’m not shooting for an ad-hoc tiny
> >> fix
> >> > > but started a path where we fill each and every gap which will end
> up
> >> in
> >> > a
> >> > > functionality and UX bar just like the Spark solution.
> >> > >
> >> > > It sounds like you might already have an end-to-end solution in
> mind.
> >> It
> >> > > would be really helpful if you could put that into writing so we can
> >> all
> >> > > align our thinking.
> >> > >
> >> > > I’m not a fan of the mindset of “this is how it was done in Spark,
> so
> >> > we’ll
> >> > > just replicate it” without proper discussion. We’ve had similar
> >> > > conversations before.
> >> > >
> >> > > > But this doesn’t mean we create a single giga big FLIP after
> several
> >> > > months of discussion.
> >> > >
> >> > > I don’t think anyone is asking for a massive FLIP after lengthy
> >> > > discussions, but having a document that outlines the overall vision
> >> could
> >> > > be incredibly valuable, especially in a distributed setting. It also
> >> > opens
> >> > > the door for others to contribute to and shape this shared vision,
> >> which
> >> > is
> >> > > a core principle of community-driven open-source development.
> >> > >
> >> > > Would it be too much to ask for a FLIP that outlines the overall
> >> vision
> >> > > (without delving too deeply into the details) to ensure everyone is
> >> > aligned
> >> > > and moving in the same direction?
> >> > >
> >> > > Best,
> >> > > D.
> >> > >
> >> > > On Fri, Aug 9, 2024 at 11:44 AM Gabor Somogyi <
> >> gabor.g.somo...@gmail.com
> >> > >
> >> > > wrote:
> >> > >
> >> > > > Hi Zakelly,
> >> > > >
> >> > > > > I'd suggest we could think of this as a whole
> >> > > >
> >> > > > In general I think we have the same idea in our mind about
> >> considering
> >> > > the
> >> > > > state observability as a whole, just we need to agree about the
> >> > physical
> >> > > > task scheduling.
> >> > > >
> >> > > > > But such a solution requires more design and discussion
> >> > > >
> >> > > > I can't even agree more. But this doesn't mean we create a single
> >> giga
> >> > > big
> >> > > > FLIP after several months of discussion.
> >> > > >
> >> > > > > Regarding the current issue you are facing, here's my idea
> >> > > >
> >> > > > Just to make it crystal clear, I'm not shooting for ad-hoc tiny
> fix
> >> but
> >> > > > started a path where we fill each and every gap which will end-up
> >> in a
> >> > > > solution
> >> > > > where we hit the functionality and UX bar just like the Spark
> >> solution.
> >> > > > From plan and code perspective I'm more ahead of this FLIP.
> >> > > > So when you aim for different task scheduling then make your exact
> >> > > > suggestion instead of providing hacks.
> >> > > >
> >> > > > If I assume correctly you suggest to create a FLIP where we define
> >> and
> >> > > > agree all the missing pieces in a single giga big FLIP, right?
> >> > > > I would say there are obvious missing pieces which are clear that
> >> they
> >> > > > needed. Just like in PRs the more consumable pieces we have
> >> > > > the better because this single change is about 1k lines of code.
> >> Having
> >> > > an
> >> > > > overkill FLIP/PR can end up in feature creep which I think
> >> > > > is disadvantageous.
> >> > > >
> >> > > > Of course this doesn't exclude the possibility that we start
> general
> >> > more
> >> > > > high level discussion about the whole state observability story.
> >> > > > Here are my high level conceptual points (I consider roughly each
> >> point
> >> > > as
> >> > > > a separate FLIP):
> >> > > > * Store human readable IDs for operators in metadata
> >> > > > * Expose the metadata as data stream
> >> > > > * Store state with user defined schemas as self containing entity
> >> > > > * SQL integration
> >> > > > * State metastore with all the created checkpoints/savepoints
> >> > > > * State file cleanup strategy in case of failure
> >> > > > * Optional: Some extra tool like metadata explorer
> >> > > >
> >> > > > That said I suggest to split the higher level discussion from this
> >> FLIP
> >> > > in
> >> > > > a separate thread.
> >> > > >
> >> > > > BR,
> >> > > > G
> >> > > >
> >> > > >
> >> > > > On Fri, Aug 9, 2024 at 10:17 AM Zakelly Lan <
> zakelly....@gmail.com>
> >> > > wrote:
> >> > > >
> >> > > > > Hi Márton and Gabor,
> >> > > > >
> >> > > > > Thanks for sharing context!
> >> > > > >
> >> > > > > Yes, I'd admit that users need a more friendly way to explore
> >> states.
> >> > > And
> >> > > > > it seems Flink lacks something like the state metadata store.
> I'd
> >> > > suggest
> >> > > > > we could think of this as a whole, to store enough information
> for
> >> > > > > querying, including operator names, uids, hashes, as well as the
> >> > state
> >> > > > > types or descriptors. Moreover we provide a tool to list those
> >> > > metadata.
> >> > > > My
> >> > > > > thoughts is to provide a complete solution instead of adding one
> >> or
> >> > two
> >> > > > > specific data alongside the checkpoint. WDTY? I believe with the
> >> > state
> >> > > > > schema queryable, the State Processor API could become more
> >> powerful
> >> > > and
> >> > > > > easier to use.
> >> > > > >
> >> > > > > But such a solution requires more design and discussion.
> Regarding
> >> > the
> >> > > > > current issue you are facing, here's my idea: If you could get
> >> access
> >> > > to
> >> > > > > the web UI, you can get the hash (vertex id) in the url by
> >> clicking
> >> > and
> >> > > > > zooming in on the operator you want to query. IIUC, this hash
> can
> >> be
> >> > > used
> >> > > > > to query the state. Is this feasible? Additionally, I think we
> >> could
> >> > > add
> >> > > > > user-defined UIDs on the web UI and related REST APIs. Thus
> users
> >> > could
> >> > > > > easily identify an operator by uid, or get the uid of an
> operator.
> >> > > > >
> >> > > > > Best,
> >> > > > > Zakelly
> >> > > > >
> >> > > > > On Thu, Aug 8, 2024 at 11:03 PM Gabor Somogyi <
> >> > > gabor.g.somo...@gmail.com
> >> > > > >
> >> > > > > wrote:
> >> > > > >
> >> > > > > > Hi Zakelly,
> >> > > > > >
> >> > > > > > Thanks for the feedback, let me elaborate on this.
> >> > > > > >
> >> > > > > > In short Databricks has created a much more user friendly
> >> > solution[1]
> >> > > > for
> >> > > > > > state observability (based on Flink's state processor API)
> than
> >> > what
> >> > > we
> >> > > > > > have now.
> >> > > > > >
> >> > > > > > Up until now our state processor API was good enough but now
> >> we're
> >> > > > > lagging
> >> > > > > > behind. We see users (just like Spark) where the first class
> >> > citizen
> >> > > is
> >> > > > > the
> >> > > > > > state itself and they're
> >> > > > > > pointing to the new Spark solution. Since the state became
> first
> >> > > class
> >> > > > > > citizen there is a natural need to use it for business logic
> >> > > > validation,
> >> > > > > > debugging, explanatory browsing, etc...
> >> > > > > >
> >> > > > > > The main message here is that there are cases where users are
> >> not
> >> > > able
> >> > > > to
> >> > > > > > identify operators because hash is a one way conversion.
> >> > > > > > I'm open to any suggestion but somehow the initial operator
> >> human
> >> > > > > readable
> >> > > > > > identifier must be available. Let me come up with examples
> where
> >> > > > > > users are completely blind.
> >> > > > > >
> >> > > > > > > Are you saying the user can set the operator uid but then
> >> doesn't
> >> > > > know
> >> > > > > > what they set when debugging?
> >> > > > > >
> >> > > > > > There are cases where the user is setting the UID in the job,
> >> such
> >> > > case
> >> > > > > > it's not user friendly to parse git repos but doable.
> >> > > > > > But there are cases where the user has limited or no control
> >> > related
> >> > > > > UIDs:
> >> > > > > > * SQL jobs are generating operators with meaningful names,
> but I
> >> > > think
> >> > > > > it's
> >> > > > > > not realistic to enforce users to understand all the internals
> >> of
> >> > > Flink
> >> > > > > SQL
> >> > > > > > implementation (which operator named where and how).
> >> > > > > > * Iceberg is using the given UID as prefix and generating more
> >> > > > operators
> >> > > > > > with it
> >> > > > > > * Weak justification but exists: Since operator name and UID
> are
> >> > both
> >> > > > > > optional some of the users are setting name only. Such case
> >> Flink
> >> > > > > generates
> >> > > > > > a random hash, where only name can give some pointers.
> >> > > > > >
> >> > > > > > Hope I've given better context.
> >> > > > > >
> >> > > > > > [1]
> >> > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://www.databricks.com/blog/announcing-state-reader-api-new-statestore-data-source
> >> > > > > >
> >> > > > > > BR,
> >> > > > > > G
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > On Thu, Aug 8, 2024 at 12:06 PM Zakelly Lan <
> >> zakelly....@gmail.com
> >> > >
> >> > > > > wrote:
> >> > > > > >
> >> > > > > > > Hi Gabor,
> >> > > > > > >
> >> > > > > > > Thanks for the proposal! However, I find it a little
> strange.
> >> Are
> >> > > you
> >> > > > > > > saying the user can set the operator uid but then doesn't
> know
> >> > what
> >> > > > > they
> >> > > > > > > set when debugging? Otherwise, is the
> >> > > > > > `OperatorIdentifier.forUid("my-uid")`
> >> > > > > > > feasible? I understand your point about potential cross-team
> >> > work,
> >> > > > but
> >> > > > > > the
> >> > > > > > > person may not be able to debug code that was not written by
> >> > them.
> >> > > > > Things
> >> > > > > > > get complex in this scenario. Could you provide more details
> >> > about
> >> > > > the
> >> > > > > > > issue you are facing?
> >> > > > > > >
> >> > > > > > > Regarding the checkpoint, it is not designed to be
> >> self-contained
> >> > > or
> >> > > > > > > human-readable. I suggest not introducing such columns for
> >> > > debugging
> >> > > > > > > purposes.
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > Best,
> >> > > > > > > Zakelly
> >> > > > > > >
> >> > > > > > > On Wed, Aug 7, 2024 at 10:07 PM Gabor Somogyi <
> >> > > > > gabor.g.somo...@gmail.com
> >> > > > > > >
> >> > > > > > > wrote:
> >> > > > > > >
> >> > > > > > > > Hi Devs,
> >> > > > > > > >
> >> > > > > > > > I would like to start a discussion on FLIP-474: Store
> >> operator
> >> > > name
> >> > > > > and
> >> > > > > > > UID
> >> > > > > > > > in state metadata[1].
> >> > > > > > > >
> >> > > > > > > > In short users are interested in what kind of operators
> are
> >> > > inside
> >> > > > a
> >> > > > > > > > checkpoint data which can be enhanced from user experience
> >> > > > > perspective.
> >> > > > > > > The
> >> > > > > > > > details can be found in FLIP-474[1].
> >> > > > > > > >
> >> > > > > > > > Please share your thoughts on this.
> >> > > > > > > >
> >> > > > > > > > [1]
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-474%3A+Store+operator+name+and+UID+in+state+metadata
> >> > > > > > > >
> >> > > > > > > > BR,
> >> > > > > > > > G
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: [DISCUSS] FLIP-474: Store operator name and UID in state metadata

Reply via email to