Hi Gabor, Thanks for clarifying! It's important that we share the same vision.
Best, Zakelly On Mon, Aug 26, 2024 at 3:56 PM Gabor Somogyi <gabor.g.somo...@gmail.com> wrote: > Hi All, > > Thanks for the contributions! > The asked umbrella document helped to resolve all the misunderstandings and > align to a common result. > > As a result I think we're ready for the voting thread. > > BR, > G > > > On Mon, Aug 19, 2024 at 8:47 AM Gabor Somogyi <gabor.g.somo...@gmail.com> > wrote: > > > Hi All, > > > > Based on our agreement I've created a draft Flink state observability > > umbrella [1]. > > Please share your comments. It contains some details to give some > insights > > but the focus would be on the direction. > > > > [1] > > > https://docs.google.com/document/d/1Du1-TShoOjaNDCahs3sgLWIpYkXzJPdSkgHcWLpELyw/edit > > > > BR, > > G > > > > > > On Sat, Aug 10, 2024 at 10:54 AM Zakelly Lan <zakelly....@gmail.com> > > wrote: > > > >> Hi Gabor, > >> > >> I apologize for any confusion. Let me clarify my position. > >> > >> The concept of state observability is important for users, and the > current > >> FLIP seems to be a step in the right direction. However, before we > >> proceed, > >> I suggest we discuss the final presentation of the state observability > to > >> the user and consider the high-level vision for achieving this. It's > >> essential to ensure that the current FLIP aligns with the overall > >> objective. I'm not suggesting a comprehensive FLIP to address all the > >> missing pieces, and one FLIP for each piece is fine for me. I just want > to > >> ensure that we are on the same page in terms of vision. The last thing I > >> want is a fragmented approach resulting in refactoring or deprecation of > >> code when we need a complete feature. > >> > >> Actually, I would hesitate about the current proposal of adding uid *in* > >> the state metadata. It may cause state incompatibility issues across > >> versions. In theory we can do this but it is better not if we are adding > >> data not for fault tolerance but only for human readability. And it > could > >> be worse if we add one or two columns sporadically in future. > >> > >> In fact, I expect the state metadata store to exist next to the > checkpoint > >> metadata, rather than within it. This gives us enough flexibility to > >> polish > >> this function as users need it, and without breaking checkpoint > >> compatibility too often. Or moreover we don't have to stick to the form > of > >> checkpoint and we could choose a more human readable format like json > for > >> the metadata store. This is where I think this FLIP is inconsistent with > >> my > >> expectation of the state observability approach. These considerations > >> deserve a discussion before proceeding with other details. WDTY? > >> > >> > >> Best, > >> Zakelly > >> > >> On Fri, Aug 9, 2024 at 8:22 PM Gabor Somogyi <gabor.g.somo...@gmail.com > > > >> wrote: > >> > >> > Hi David, > >> > > >> > Thanks for sharing your thoughts! > >> > > >> > > It sounds like you might already have an end-to-end solution in > mind. > >> It > >> > would be really helpful if you could put that into writing so we can > all > >> > align our thinking. > >> > > >> > It makes sense to create a high level vision. > >> > > >> > > I’m not a fan of the mindset of “this is how it was done in Spark, > so > >> > we’ll > >> > just replicate it” without proper discussion. We’ve had similar > >> > conversations before. > >> > > >> > I think we've had this conversation already in case of delegation > token > >> > framework > >> > and I can say the same. No intention to take over things blindly but > >> it's > >> > not a shame > >> > to be inspired by solutions which are welcome by users. > >> > The intention is similar just like in scalable authentication area > where > >> > Flink is now ahead of Spark. > >> > > >> > > Would it be too much to ask for a FLIP that outlines the overall > >> vision > >> > (without delving too deeply into the details) to ensure everyone is > >> aligned > >> > and moving in the same direction? > >> > > >> > That's a fair point and a constructive way how we can proceed. > >> > I'm going to come back with the details... > >> > > >> > BR, > >> > G > >> > > >> > > >> > On Fri, Aug 9, 2024 at 1:36 PM David Morávek <d...@apache.org> wrote: > >> > > >> > > Hi Gabor, > >> > > > >> > > Thanks for taking the initiative on this. It’s clear that > significant > >> > > improvements are needed in this area, and parsing state files can be > >> > > incredibly challenging, even for those who are well-versed in it. > >> > > > >> > > > Just to make it crystal clear, I’m not shooting for an ad-hoc tiny > >> fix > >> > > but started a path where we fill each and every gap which will end > up > >> in > >> > a > >> > > functionality and UX bar just like the Spark solution. > >> > > > >> > > It sounds like you might already have an end-to-end solution in > mind. > >> It > >> > > would be really helpful if you could put that into writing so we can > >> all > >> > > align our thinking. > >> > > > >> > > I’m not a fan of the mindset of “this is how it was done in Spark, > so > >> > we’ll > >> > > just replicate it” without proper discussion. We’ve had similar > >> > > conversations before. > >> > > > >> > > > But this doesn’t mean we create a single giga big FLIP after > several > >> > > months of discussion. > >> > > > >> > > I don’t think anyone is asking for a massive FLIP after lengthy > >> > > discussions, but having a document that outlines the overall vision > >> could > >> > > be incredibly valuable, especially in a distributed setting. It also > >> > opens > >> > > the door for others to contribute to and shape this shared vision, > >> which > >> > is > >> > > a core principle of community-driven open-source development. > >> > > > >> > > Would it be too much to ask for a FLIP that outlines the overall > >> vision > >> > > (without delving too deeply into the details) to ensure everyone is > >> > aligned > >> > > and moving in the same direction? > >> > > > >> > > Best, > >> > > D. > >> > > > >> > > On Fri, Aug 9, 2024 at 11:44 AM Gabor Somogyi < > >> gabor.g.somo...@gmail.com > >> > > > >> > > wrote: > >> > > > >> > > > Hi Zakelly, > >> > > > > >> > > > > I'd suggest we could think of this as a whole > >> > > > > >> > > > In general I think we have the same idea in our mind about > >> considering > >> > > the > >> > > > state observability as a whole, just we need to agree about the > >> > physical > >> > > > task scheduling. > >> > > > > >> > > > > But such a solution requires more design and discussion > >> > > > > >> > > > I can't even agree more. But this doesn't mean we create a single > >> giga > >> > > big > >> > > > FLIP after several months of discussion. > >> > > > > >> > > > > Regarding the current issue you are facing, here's my idea > >> > > > > >> > > > Just to make it crystal clear, I'm not shooting for ad-hoc tiny > fix > >> but > >> > > > started a path where we fill each and every gap which will end-up > >> in a > >> > > > solution > >> > > > where we hit the functionality and UX bar just like the Spark > >> solution. > >> > > > From plan and code perspective I'm more ahead of this FLIP. > >> > > > So when you aim for different task scheduling then make your exact > >> > > > suggestion instead of providing hacks. > >> > > > > >> > > > If I assume correctly you suggest to create a FLIP where we define > >> and > >> > > > agree all the missing pieces in a single giga big FLIP, right? > >> > > > I would say there are obvious missing pieces which are clear that > >> they > >> > > > needed. Just like in PRs the more consumable pieces we have > >> > > > the better because this single change is about 1k lines of code. > >> Having > >> > > an > >> > > > overkill FLIP/PR can end up in feature creep which I think > >> > > > is disadvantageous. > >> > > > > >> > > > Of course this doesn't exclude the possibility that we start > general > >> > more > >> > > > high level discussion about the whole state observability story. > >> > > > Here are my high level conceptual points (I consider roughly each > >> point > >> > > as > >> > > > a separate FLIP): > >> > > > * Store human readable IDs for operators in metadata > >> > > > * Expose the metadata as data stream > >> > > > * Store state with user defined schemas as self containing entity > >> > > > * SQL integration > >> > > > * State metastore with all the created checkpoints/savepoints > >> > > > * State file cleanup strategy in case of failure > >> > > > * Optional: Some extra tool like metadata explorer > >> > > > > >> > > > That said I suggest to split the higher level discussion from this > >> FLIP > >> > > in > >> > > > a separate thread. > >> > > > > >> > > > BR, > >> > > > G > >> > > > > >> > > > > >> > > > On Fri, Aug 9, 2024 at 10:17 AM Zakelly Lan < > zakelly....@gmail.com> > >> > > wrote: > >> > > > > >> > > > > Hi Márton and Gabor, > >> > > > > > >> > > > > Thanks for sharing context! > >> > > > > > >> > > > > Yes, I'd admit that users need a more friendly way to explore > >> states. > >> > > And > >> > > > > it seems Flink lacks something like the state metadata store. > I'd > >> > > suggest > >> > > > > we could think of this as a whole, to store enough information > for > >> > > > > querying, including operator names, uids, hashes, as well as the > >> > state > >> > > > > types or descriptors. Moreover we provide a tool to list those > >> > > metadata. > >> > > > My > >> > > > > thoughts is to provide a complete solution instead of adding one > >> or > >> > two > >> > > > > specific data alongside the checkpoint. WDTY? I believe with the > >> > state > >> > > > > schema queryable, the State Processor API could become more > >> powerful > >> > > and > >> > > > > easier to use. > >> > > > > > >> > > > > But such a solution requires more design and discussion. > Regarding > >> > the > >> > > > > current issue you are facing, here's my idea: If you could get > >> access > >> > > to > >> > > > > the web UI, you can get the hash (vertex id) in the url by > >> clicking > >> > and > >> > > > > zooming in on the operator you want to query. IIUC, this hash > can > >> be > >> > > used > >> > > > > to query the state. Is this feasible? Additionally, I think we > >> could > >> > > add > >> > > > > user-defined UIDs on the web UI and related REST APIs. Thus > users > >> > could > >> > > > > easily identify an operator by uid, or get the uid of an > operator. > >> > > > > > >> > > > > Best, > >> > > > > Zakelly > >> > > > > > >> > > > > On Thu, Aug 8, 2024 at 11:03 PM Gabor Somogyi < > >> > > gabor.g.somo...@gmail.com > >> > > > > > >> > > > > wrote: > >> > > > > > >> > > > > > Hi Zakelly, > >> > > > > > > >> > > > > > Thanks for the feedback, let me elaborate on this. > >> > > > > > > >> > > > > > In short Databricks has created a much more user friendly > >> > solution[1] > >> > > > for > >> > > > > > state observability (based on Flink's state processor API) > than > >> > what > >> > > we > >> > > > > > have now. > >> > > > > > > >> > > > > > Up until now our state processor API was good enough but now > >> we're > >> > > > > lagging > >> > > > > > behind. We see users (just like Spark) where the first class > >> > citizen > >> > > is > >> > > > > the > >> > > > > > state itself and they're > >> > > > > > pointing to the new Spark solution. Since the state became > first > >> > > class > >> > > > > > citizen there is a natural need to use it for business logic > >> > > > validation, > >> > > > > > debugging, explanatory browsing, etc... > >> > > > > > > >> > > > > > The main message here is that there are cases where users are > >> not > >> > > able > >> > > > to > >> > > > > > identify operators because hash is a one way conversion. > >> > > > > > I'm open to any suggestion but somehow the initial operator > >> human > >> > > > > readable > >> > > > > > identifier must be available. Let me come up with examples > where > >> > > > > > users are completely blind. > >> > > > > > > >> > > > > > > Are you saying the user can set the operator uid but then > >> doesn't > >> > > > know > >> > > > > > what they set when debugging? > >> > > > > > > >> > > > > > There are cases where the user is setting the UID in the job, > >> such > >> > > case > >> > > > > > it's not user friendly to parse git repos but doable. > >> > > > > > But there are cases where the user has limited or no control > >> > related > >> > > > > UIDs: > >> > > > > > * SQL jobs are generating operators with meaningful names, > but I > >> > > think > >> > > > > it's > >> > > > > > not realistic to enforce users to understand all the internals > >> of > >> > > Flink > >> > > > > SQL > >> > > > > > implementation (which operator named where and how). > >> > > > > > * Iceberg is using the given UID as prefix and generating more > >> > > > operators > >> > > > > > with it > >> > > > > > * Weak justification but exists: Since operator name and UID > are > >> > both > >> > > > > > optional some of the users are setting name only. Such case > >> Flink > >> > > > > generates > >> > > > > > a random hash, where only name can give some pointers. > >> > > > > > > >> > > > > > Hope I've given better context. > >> > > > > > > >> > > > > > [1] > >> > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > https://www.databricks.com/blog/announcing-state-reader-api-new-statestore-data-source > >> > > > > > > >> > > > > > BR, > >> > > > > > G > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > On Thu, Aug 8, 2024 at 12:06 PM Zakelly Lan < > >> zakelly....@gmail.com > >> > > > >> > > > > wrote: > >> > > > > > > >> > > > > > > Hi Gabor, > >> > > > > > > > >> > > > > > > Thanks for the proposal! However, I find it a little > strange. > >> Are > >> > > you > >> > > > > > > saying the user can set the operator uid but then doesn't > know > >> > what > >> > > > > they > >> > > > > > > set when debugging? Otherwise, is the > >> > > > > > `OperatorIdentifier.forUid("my-uid")` > >> > > > > > > feasible? I understand your point about potential cross-team > >> > work, > >> > > > but > >> > > > > > the > >> > > > > > > person may not be able to debug code that was not written by > >> > them. > >> > > > > Things > >> > > > > > > get complex in this scenario. Could you provide more details > >> > about > >> > > > the > >> > > > > > > issue you are facing? > >> > > > > > > > >> > > > > > > Regarding the checkpoint, it is not designed to be > >> self-contained > >> > > or > >> > > > > > > human-readable. I suggest not introducing such columns for > >> > > debugging > >> > > > > > > purposes. > >> > > > > > > > >> > > > > > > > >> > > > > > > Best, > >> > > > > > > Zakelly > >> > > > > > > > >> > > > > > > On Wed, Aug 7, 2024 at 10:07 PM Gabor Somogyi < > >> > > > > gabor.g.somo...@gmail.com > >> > > > > > > > >> > > > > > > wrote: > >> > > > > > > > >> > > > > > > > Hi Devs, > >> > > > > > > > > >> > > > > > > > I would like to start a discussion on FLIP-474: Store > >> operator > >> > > name > >> > > > > and > >> > > > > > > UID > >> > > > > > > > in state metadata[1]. > >> > > > > > > > > >> > > > > > > > In short users are interested in what kind of operators > are > >> > > inside > >> > > > a > >> > > > > > > > checkpoint data which can be enhanced from user experience > >> > > > > perspective. > >> > > > > > > The > >> > > > > > > > details can be found in FLIP-474[1]. > >> > > > > > > > > >> > > > > > > > Please share your thoughts on this. > >> > > > > > > > > >> > > > > > > > [1] > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-474%3A+Store+operator+name+and+UID+in+state+metadata > >> > > > > > > > > >> > > > > > > > BR, > >> > > > > > > > G > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > > >