Hi Gabor, I apologize for any confusion. Let me clarify my position.
The concept of state observability is important for users, and the current FLIP seems to be a step in the right direction. However, before we proceed, I suggest we discuss the final presentation of the state observability to the user and consider the high-level vision for achieving this. It's essential to ensure that the current FLIP aligns with the overall objective. I'm not suggesting a comprehensive FLIP to address all the missing pieces, and one FLIP for each piece is fine for me. I just want to ensure that we are on the same page in terms of vision. The last thing I want is a fragmented approach resulting in refactoring or deprecation of code when we need a complete feature. Actually, I would hesitate about the current proposal of adding uid *in* the state metadata. It may cause state incompatibility issues across versions. In theory we can do this but it is better not if we are adding data not for fault tolerance but only for human readability. And it could be worse if we add one or two columns sporadically in future. In fact, I expect the state metadata store to exist next to the checkpoint metadata, rather than within it. This gives us enough flexibility to polish this function as users need it, and without breaking checkpoint compatibility too often. Or moreover we don't have to stick to the form of checkpoint and we could choose a more human readable format like json for the metadata store. This is where I think this FLIP is inconsistent with my expectation of the state observability approach. These considerations deserve a discussion before proceeding with other details. WDTY? Best, Zakelly On Fri, Aug 9, 2024 at 8:22 PM Gabor Somogyi <gabor.g.somo...@gmail.com> wrote: > Hi David, > > Thanks for sharing your thoughts! > > > It sounds like you might already have an end-to-end solution in mind. It > would be really helpful if you could put that into writing so we can all > align our thinking. > > It makes sense to create a high level vision. > > > I’m not a fan of the mindset of “this is how it was done in Spark, so > we’ll > just replicate it” without proper discussion. We’ve had similar > conversations before. > > I think we've had this conversation already in case of delegation token > framework > and I can say the same. No intention to take over things blindly but it's > not a shame > to be inspired by solutions which are welcome by users. > The intention is similar just like in scalable authentication area where > Flink is now ahead of Spark. > > > Would it be too much to ask for a FLIP that outlines the overall vision > (without delving too deeply into the details) to ensure everyone is aligned > and moving in the same direction? > > That's a fair point and a constructive way how we can proceed. > I'm going to come back with the details... > > BR, > G > > > On Fri, Aug 9, 2024 at 1:36 PM David Morávek <d...@apache.org> wrote: > > > Hi Gabor, > > > > Thanks for taking the initiative on this. It’s clear that significant > > improvements are needed in this area, and parsing state files can be > > incredibly challenging, even for those who are well-versed in it. > > > > > Just to make it crystal clear, I’m not shooting for an ad-hoc tiny fix > > but started a path where we fill each and every gap which will end up in > a > > functionality and UX bar just like the Spark solution. > > > > It sounds like you might already have an end-to-end solution in mind. It > > would be really helpful if you could put that into writing so we can all > > align our thinking. > > > > I’m not a fan of the mindset of “this is how it was done in Spark, so > we’ll > > just replicate it” without proper discussion. We’ve had similar > > conversations before. > > > > > But this doesn’t mean we create a single giga big FLIP after several > > months of discussion. > > > > I don’t think anyone is asking for a massive FLIP after lengthy > > discussions, but having a document that outlines the overall vision could > > be incredibly valuable, especially in a distributed setting. It also > opens > > the door for others to contribute to and shape this shared vision, which > is > > a core principle of community-driven open-source development. > > > > Would it be too much to ask for a FLIP that outlines the overall vision > > (without delving too deeply into the details) to ensure everyone is > aligned > > and moving in the same direction? > > > > Best, > > D. > > > > On Fri, Aug 9, 2024 at 11:44 AM Gabor Somogyi <gabor.g.somo...@gmail.com > > > > wrote: > > > > > Hi Zakelly, > > > > > > > I'd suggest we could think of this as a whole > > > > > > In general I think we have the same idea in our mind about considering > > the > > > state observability as a whole, just we need to agree about the > physical > > > task scheduling. > > > > > > > But such a solution requires more design and discussion > > > > > > I can't even agree more. But this doesn't mean we create a single giga > > big > > > FLIP after several months of discussion. > > > > > > > Regarding the current issue you are facing, here's my idea > > > > > > Just to make it crystal clear, I'm not shooting for ad-hoc tiny fix but > > > started a path where we fill each and every gap which will end-up in a > > > solution > > > where we hit the functionality and UX bar just like the Spark solution. > > > From plan and code perspective I'm more ahead of this FLIP. > > > So when you aim for different task scheduling then make your exact > > > suggestion instead of providing hacks. > > > > > > If I assume correctly you suggest to create a FLIP where we define and > > > agree all the missing pieces in a single giga big FLIP, right? > > > I would say there are obvious missing pieces which are clear that they > > > needed. Just like in PRs the more consumable pieces we have > > > the better because this single change is about 1k lines of code. Having > > an > > > overkill FLIP/PR can end up in feature creep which I think > > > is disadvantageous. > > > > > > Of course this doesn't exclude the possibility that we start general > more > > > high level discussion about the whole state observability story. > > > Here are my high level conceptual points (I consider roughly each point > > as > > > a separate FLIP): > > > * Store human readable IDs for operators in metadata > > > * Expose the metadata as data stream > > > * Store state with user defined schemas as self containing entity > > > * SQL integration > > > * State metastore with all the created checkpoints/savepoints > > > * State file cleanup strategy in case of failure > > > * Optional: Some extra tool like metadata explorer > > > > > > That said I suggest to split the higher level discussion from this FLIP > > in > > > a separate thread. > > > > > > BR, > > > G > > > > > > > > > On Fri, Aug 9, 2024 at 10:17 AM Zakelly Lan <zakelly....@gmail.com> > > wrote: > > > > > > > Hi Márton and Gabor, > > > > > > > > Thanks for sharing context! > > > > > > > > Yes, I'd admit that users need a more friendly way to explore states. > > And > > > > it seems Flink lacks something like the state metadata store. I'd > > suggest > > > > we could think of this as a whole, to store enough information for > > > > querying, including operator names, uids, hashes, as well as the > state > > > > types or descriptors. Moreover we provide a tool to list those > > metadata. > > > My > > > > thoughts is to provide a complete solution instead of adding one or > two > > > > specific data alongside the checkpoint. WDTY? I believe with the > state > > > > schema queryable, the State Processor API could become more powerful > > and > > > > easier to use. > > > > > > > > But such a solution requires more design and discussion. Regarding > the > > > > current issue you are facing, here's my idea: If you could get access > > to > > > > the web UI, you can get the hash (vertex id) in the url by clicking > and > > > > zooming in on the operator you want to query. IIUC, this hash can be > > used > > > > to query the state. Is this feasible? Additionally, I think we could > > add > > > > user-defined UIDs on the web UI and related REST APIs. Thus users > could > > > > easily identify an operator by uid, or get the uid of an operator. > > > > > > > > Best, > > > > Zakelly > > > > > > > > On Thu, Aug 8, 2024 at 11:03 PM Gabor Somogyi < > > gabor.g.somo...@gmail.com > > > > > > > > wrote: > > > > > > > > > Hi Zakelly, > > > > > > > > > > Thanks for the feedback, let me elaborate on this. > > > > > > > > > > In short Databricks has created a much more user friendly > solution[1] > > > for > > > > > state observability (based on Flink's state processor API) than > what > > we > > > > > have now. > > > > > > > > > > Up until now our state processor API was good enough but now we're > > > > lagging > > > > > behind. We see users (just like Spark) where the first class > citizen > > is > > > > the > > > > > state itself and they're > > > > > pointing to the new Spark solution. Since the state became first > > class > > > > > citizen there is a natural need to use it for business logic > > > validation, > > > > > debugging, explanatory browsing, etc... > > > > > > > > > > The main message here is that there are cases where users are not > > able > > > to > > > > > identify operators because hash is a one way conversion. > > > > > I'm open to any suggestion but somehow the initial operator human > > > > readable > > > > > identifier must be available. Let me come up with examples where > > > > > users are completely blind. > > > > > > > > > > > Are you saying the user can set the operator uid but then doesn't > > > know > > > > > what they set when debugging? > > > > > > > > > > There are cases where the user is setting the UID in the job, such > > case > > > > > it's not user friendly to parse git repos but doable. > > > > > But there are cases where the user has limited or no control > related > > > > UIDs: > > > > > * SQL jobs are generating operators with meaningful names, but I > > think > > > > it's > > > > > not realistic to enforce users to understand all the internals of > > Flink > > > > SQL > > > > > implementation (which operator named where and how). > > > > > * Iceberg is using the given UID as prefix and generating more > > > operators > > > > > with it > > > > > * Weak justification but exists: Since operator name and UID are > both > > > > > optional some of the users are setting name only. Such case Flink > > > > generates > > > > > a random hash, where only name can give some pointers. > > > > > > > > > > Hope I've given better context. > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > https://www.databricks.com/blog/announcing-state-reader-api-new-statestore-data-source > > > > > > > > > > BR, > > > > > G > > > > > > > > > > > > > > > > > > > > On Thu, Aug 8, 2024 at 12:06 PM Zakelly Lan <zakelly....@gmail.com > > > > > > wrote: > > > > > > > > > > > Hi Gabor, > > > > > > > > > > > > Thanks for the proposal! However, I find it a little strange. Are > > you > > > > > > saying the user can set the operator uid but then doesn't know > what > > > > they > > > > > > set when debugging? Otherwise, is the > > > > > `OperatorIdentifier.forUid("my-uid")` > > > > > > feasible? I understand your point about potential cross-team > work, > > > but > > > > > the > > > > > > person may not be able to debug code that was not written by > them. > > > > Things > > > > > > get complex in this scenario. Could you provide more details > about > > > the > > > > > > issue you are facing? > > > > > > > > > > > > Regarding the checkpoint, it is not designed to be self-contained > > or > > > > > > human-readable. I suggest not introducing such columns for > > debugging > > > > > > purposes. > > > > > > > > > > > > > > > > > > Best, > > > > > > Zakelly > > > > > > > > > > > > On Wed, Aug 7, 2024 at 10:07 PM Gabor Somogyi < > > > > gabor.g.somo...@gmail.com > > > > > > > > > > > > wrote: > > > > > > > > > > > > > Hi Devs, > > > > > > > > > > > > > > I would like to start a discussion on FLIP-474: Store operator > > name > > > > and > > > > > > UID > > > > > > > in state metadata[1]. > > > > > > > > > > > > > > In short users are interested in what kind of operators are > > inside > > > a > > > > > > > checkpoint data which can be enhanced from user experience > > > > perspective. > > > > > > The > > > > > > > details can be found in FLIP-474[1]. > > > > > > > > > > > > > > Please share your thoughts on this. > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-474%3A+Store+operator+name+and+UID+in+state+metadata > > > > > > > > > > > > > > BR, > > > > > > > G > > > > > > > > > > > > > > > > > > > > > > > > > > > >