Hi Zakelly,

> I'd suggest we could think of this as a whole

In general I think we have the same idea in our mind about considering the
state observability as a whole, just we need to agree about the physical
task scheduling.

> But such a solution requires more design and discussion

I can't even agree more. But this doesn't mean we create a single giga big
FLIP after several months of discussion.

> Regarding the current issue you are facing, here's my idea

Just to make it crystal clear, I'm not shooting for ad-hoc tiny fix but
started a path where we fill each and every gap which will end-up in a
solution
where we hit the functionality and UX bar just like the Spark solution.
>From plan and code perspective I'm more ahead of this FLIP.
So when you aim for different task scheduling then make your exact
suggestion instead of providing hacks.

If I assume correctly you suggest to create a FLIP where we define and
agree all the missing pieces in a single giga big FLIP, right?
I would say there are obvious missing pieces which are clear that they
needed. Just like in PRs the more consumable pieces we have
the better because this single change is about 1k lines of code. Having an
overkill FLIP/PR can end up in feature creep which I think
is disadvantageous.

Of course this doesn't exclude the possibility that we start general more
high level discussion about the whole state observability story.
Here are my high level conceptual points (I consider roughly each point as
a separate FLIP):
* Store human readable IDs for operators in metadata
* Expose the metadata as data stream
* Store state with user defined schemas as self containing entity
* SQL integration
* State metastore with all the created checkpoints/savepoints
* State file cleanup strategy in case of failure
* Optional: Some extra tool like metadata explorer

That said I suggest to split the higher level discussion from this FLIP in
a separate thread.

BR,
G


On Fri, Aug 9, 2024 at 10:17 AM Zakelly Lan <zakelly....@gmail.com> wrote:

> Hi Márton and Gabor,
>
> Thanks for sharing context!
>
> Yes, I'd admit that users need a more friendly way to explore states. And
> it seems Flink lacks something like the state metadata store. I'd suggest
> we could think of this as a whole, to store enough information for
> querying, including operator names, uids, hashes, as well as the state
> types or descriptors. Moreover we provide a tool to list those metadata. My
> thoughts is to provide a complete solution instead of adding one or two
> specific data alongside the checkpoint. WDTY? I believe with the state
> schema queryable, the State Processor API could become more powerful and
> easier to use.
>
> But such a solution requires more design and discussion. Regarding the
> current issue you are facing, here's my idea: If you could get access to
> the web UI, you can get the hash (vertex id) in the url by clicking and
> zooming in on the operator you want to query. IIUC, this hash can be used
> to query the state. Is this feasible? Additionally, I think we could add
> user-defined UIDs on the web UI and related REST APIs. Thus users could
> easily identify an operator by uid, or get the uid of an operator.
>
> Best,
> Zakelly
>
> On Thu, Aug 8, 2024 at 11:03 PM Gabor Somogyi <gabor.g.somo...@gmail.com>
> wrote:
>
> > Hi Zakelly,
> >
> > Thanks for the feedback, let me elaborate on this.
> >
> > In short Databricks has created a much more user friendly solution[1] for
> > state observability (based on Flink's state processor API) than what we
> > have now.
> >
> > Up until now our state processor API was good enough but now we're
> lagging
> > behind. We see users (just like Spark) where the first class citizen is
> the
> > state itself and they're
> > pointing to the new Spark solution. Since the state became first class
> > citizen there is a natural need to use it for business logic validation,
> > debugging, explanatory browsing, etc...
> >
> > The main message here is that there are cases where users are not able to
> > identify operators because hash is a one way conversion.
> > I'm open to any suggestion but somehow the initial operator human
> readable
> > identifier must be available. Let me come up with examples where
> > users are completely blind.
> >
> > > Are you saying the user can set the operator uid but then doesn't know
> > what they set when debugging?
> >
> > There are cases where the user is setting the UID in the job, such case
> > it's not user friendly to parse git repos but doable.
> > But there are cases where the user has limited or no control related
> UIDs:
> > * SQL jobs are generating operators with meaningful names, but I think
> it's
> > not realistic to enforce users to understand all the internals of Flink
> SQL
> > implementation (which operator named where and how).
> > * Iceberg is using the given UID as prefix and generating more operators
> > with it
> > * Weak justification but exists: Since operator name and UID are both
> > optional some of the users are setting name only. Such case Flink
> generates
> > a random hash, where only name can give some pointers.
> >
> > Hope I've given better context.
> >
> > [1]
> >
> >
> https://www.databricks.com/blog/announcing-state-reader-api-new-statestore-data-source
> >
> > BR,
> > G
> >
> >
> >
> > On Thu, Aug 8, 2024 at 12:06 PM Zakelly Lan <zakelly....@gmail.com>
> wrote:
> >
> > > Hi Gabor,
> > >
> > > Thanks for the proposal! However, I find it a little strange. Are you
> > > saying the user can set the operator uid but then doesn't know what
> they
> > > set when debugging? Otherwise, is the
> > `OperatorIdentifier.forUid("my-uid")`
> > > feasible? I understand your point about potential cross-team work, but
> > the
> > > person may not be able to debug code that was not written by them.
> Things
> > > get complex in this scenario. Could you provide more details about the
> > > issue you are facing?
> > >
> > > Regarding the checkpoint, it is not designed to be self-contained or
> > > human-readable. I suggest not introducing such columns for debugging
> > > purposes.
> > >
> > >
> > > Best,
> > > Zakelly
> > >
> > > On Wed, Aug 7, 2024 at 10:07 PM Gabor Somogyi <
> gabor.g.somo...@gmail.com
> > >
> > > wrote:
> > >
> > > > Hi Devs,
> > > >
> > > > I would like to start a discussion on FLIP-474: Store operator name
> and
> > > UID
> > > > in state metadata[1].
> > > >
> > > > In short users are interested in what kind of operators are inside a
> > > > checkpoint data which can be enhanced from user experience
> perspective.
> > > The
> > > > details can be found in FLIP-474[1].
> > > >
> > > > Please share your thoughts on this.
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-474%3A+Store+operator+name+and+UID+in+state+metadata
> > > >
> > > > BR,
> > > > G
> > > >
> > >
> >
>

Reply via email to