Hi Timo, I'm fine with your latest suggestion that introducing a flag to control expanding behavior of metadata virtual columns, but not introducing any concept of system/pseudo columns for now.
Best, Jark On Tue, 15 Aug 2023 at 23:25, Timo Walther <twal...@apache.org> wrote: > Hi everyone, > > I would like to bump this thread up again. > > Esp. I would like to hear opinions on my latest suggestions to simply > use METADATA VIRTUAL as system columns and only introduce a config > option for the SELECT * behavior. Implementation-wise this means minimal > effort and less new concepts. > > Looking forward to any kind of feedback. > > Thanks, > Timo > > On 07.08.23 12:07, Timo Walther wrote: > > Hi everyone, > > > > thanks for the various feedback and lively discussion. Sorry, for the > > late reply as I was on vacation. Let me answer to some of the topics: > > > > 1) Systems columns should not be shown with DESCRIBE statements > > > > This sounds fine to me. I will update the FLIP in the next iteration. > > > > 2) Do you know why most SQL systems do not need any prefix with their > > pseudo column? > > > > Because most systems do not have external catalogs or connectors. And > > also the number of system columns is limited to a handful of columns. > > Flink is more generic and thus more complex. And we have already the > > concepts of metadata columns. We need to be careful with not overloading > > our language. > > > > 3) Implementation details > > > > > how to you plan to implement the "system columns", do we need to add > > it to `RelNode` level? Or we just need to do it in the > > parsing/validating phase? > > > I'm not sure that Calcite's "system column" feature is fully ready > > > > My plan would be to only modify the parsing/validating phase. I would > > like to avoid additional complexity in planner rules and > > connector/catalog interfaces. Metadata columns already support > > projection push down and are passed through the stack (via Schema, > > ResolvedSchema, SupportsReadableMetadata). Calcite's "system column" > > feature is not fully ready yet and it would be a large effort > > potentially introducing bugs in supporting it. Thus, I'm proposing to > > leverage what we already have. The only part that needs to be modified > > is the "expand star" method in SqlValidator and Table API. > > > > Queries such as `SELECT * FROM (SELECT $rowtime, * FROM t);` would show > > $rowtime as the expand star has only a special case when in the scope > > for `FROM t`. All further subqueries treat it as a regular column. > > > > 4) Built-in defined pseudo-column "$rowtime" > > > > > Did you consider making it as a built-in defined pseudo-column > > "$rowtime" which returns the time attribute value (if exists) or null > > (if non-exists) for every table/query, and pseudo-column "$proctime" > > always returns PROCTIME() value for each table/query > > > > Built-in pseudo-columns mean that connector or catalog providers need > > consensus in Flink which pseudo-columns should be built-in. We should > > keep the concept generic and let platform providers decide which pseudo > > columns to expose. $rowtime might be obvious but others such as > > $partition or $offset are tricky to get consensus as every external > > connector works differently. Also a connector might want to expose > > different time semantics (such as ingestion time). > > > > 5) Any operator can introduce system (psedo) columns. > > > > This is clearly out of scope for this FLIP. The implementation effort > > would be huge and could introduce a lot of bugs. > > > > 6) "Metadata Key Prefix Constraint" which is still a little complex > > > > Another option could be to drop the naming pattern constraint. We could > > make it configurable that METADATA VIRTUAL columns are never selected by > > default in SELECT * or visible in DESCRIBE. This would further simplify > > the FLIP and esp lower the impact on the planner and all interfaces. > > > > What do you think about this? We could introduce a flag: > > > > table.expand-metadata-columns (better name to be defined) > > > > This way we don't need to introduce the concept of system columns yet, > > but can still offer similar functionality with minimal overhead in the > > code base. > > > > Regards, > > Timo > > > > > > > > > > On 04.08.23 23:06, Alexey Leonov-Vendrovskiy wrote: > >> Looks like both kinds of system columns can converge. > >> We can say that any operator can introduce system (psedo) columns. > >> > >> cc Eugene who is also interested in the subject. > >> > >> On Wed, Aug 2, 2023 at 1:03 AM Paul Lam <paullin3...@gmail.com> wrote: > >> > >>> Hi Timo, > >>> > >>> Thanks for starting the discussion! System columns are no doubt a > >>> good boost on Flink SQL’s usability, and I see the feedbacks are > >>> mainly concerns about the accessibility of system columns. > >>> > >>> I think most of the concerns could be solved by clarifying the > >>> ownership of the system columns. Different from databases like > >>> Oracle/BigQuery/PG who owns the data/metadata, Flink uses the > >>> data/metadata from external systems. That means Flink could > >>> have 2 kinds of system columns (take ROWID for example): > >>> > >>> 1. system columns provided by external systems via catalogs, such > >>> as ROWID from the original system. > >>> 2. system columns generated by Flink, such as ROWID generated by > >>> Flink itself. > >>> > >>> IIUC, the FLIP is proposing the 1st approach: the catalog defines what > >>> system columns to provide, and Flink treats them as normal columns > >>> with a special naming pattern. > >>> > >>> On the other hand, Jark is proposing the 2nd one: the system columns > >>> are defined and owned by Flink, and can be inferred from external > >>> systems. Therefore, system columns should be predefined by Flink, > >>> and optionally implemented by the catalogs. > >>> > >>> Personally, I’m in favor of the 2nd approach, because it makes the > >>> system columns very accessible and more aligned across the catalogs. > >>> > >>> BTW, I second Alexey that systems columns should not be shown with > >>> DESCRIBE statements. > >>> > >>> WDYT? Thanks! > >>> > >>> Best, > >>> Paul Lam > >>> > >>>> 2023年7月31日 23:54,Jark Wu <imj...@gmail.com> 写道: > >>>> > >>>> Hi Timo, > >>>> > >>>> Thanks for your proposal. I think this is a nice feature for users > >>>> and I > >>>> prefer option 3. > >>>> > >>>> I only have one concern about the concept of pseudo-column or > >>>> system-column, > >>>> because this is the first time we introduce it in Flink SQL. The > >>>> confusion is similar to the > >>>> question of Benchao and Sergey about the propagation of pseudo-column. > >>>> > >>>> From my understanding, a pseudo-column can be get from an arbitrary > >>> query, > >>>> just similar to > >>>> ROWNUM in Oracle[1], such as : > >>>> > >>>> SELECT * > >>>> FROM (SELECT * FROM employees ORDER BY employee_id) > >>>> WHERE ROWNUM < 11; > >>>> > >>>> However, IIUC, the proposed "$rowtime" pseudo-column can only be got > >>>> from > >>>> the physical table > >>>> and can't be got from queries even if the query propagates the rowtime > >>>> attribute. There was also > >>>> a discussion about adding a pseudo-column "_proctime" [2] to make > >>>> lookup > >>>> join easier to use > >>>> which can be got from arbitrary queries. That "_proctime" may conflict > >>> with > >>>> the proposed > >>>> pseudo-column concept. > >>>> > >>>> Did you consider making it as a built-in defined pseudo-column > >>>> "$rowtime" > >>>> which returns the > >>>> time attribute value (if exists) or null (if non-exists) for every > >>>> table/query, and pseudo-column > >>>> "$proctime" always returns PROCTIME() value for each table/query. In > >>>> this > >>>> way, catalogs only need > >>>> to provide a default rowtime attribute and users can get it in the > same > >>>> way. And we don't need > >>>> to introduce the contract interface of "Metadata Key Prefix > Constraint" > >>>> which is still a little complex > >>>> for users and devs to understand. > >>>> > >>>> Best, > >>>> Jark > >>>> > >>>> [1]: > >>>> > >>> > https://docs.oracle.com/cd/E11882_01/server.112/e41084/pseudocolumns009.htm#SQLRF00255 > >>>> [2]: https://lists.apache.org/thread/7ln106qxyw8sp7ljq40hs2p1lb1gdwj5 > >>>> > >>>> > >>>> > >>>> > >>>> On Fri, 28 Jul 2023 at 06:18, Alexey Leonov-Vendrovskiy < > >>>> vendrov...@gmail.com> wrote: > >>>> > >>>>>> > >>>>>> `SELECT * FROM (SELECT $rowtime, * FROM t);` > >>>>>> Am I right that it will show `$rowtime` in output ? > >>>>> > >>>>> > >>>>> Yes, all explicitly selected columns become a part of the result (and > >>>>> intermediate) schema, and hence propagate. > >>>>> > >>>>> On Thu, Jul 27, 2023 at 2:40 PM Alexey Leonov-Vendrovskiy < > >>>>> vendrov...@gmail.com> wrote: > >>>>> > >>>>>> Thank you, Timo, for starting this FLIP! > >>>>>> > >>>>>> I propose the following change: > >>>>>> > >>>>>> Remove the requirement that DESCRIBE need to show system columns. > >>>>>> > >>>>>> > >>>>>> Some concrete vendor specific catalog implementations might prefer > >>>>>> this > >>>>>> approach. > >>>>>> Usually the same system columns are available on all (or family) of > >>>>>> tables, and it can be easily captured in the documentation. > >>>>>> > >>>>>> For example, BigQuery does exactly this: there, pseudo-columns do > not > >>>>> show > >>>>>> up in the table schema in any place, but can be accessed via > >>>>>> reference. > >>>>>> > >>>>>> So I propose we: > >>>>>> a) Either we say that DESCRIBE doesn't show system columns, > >>>>>> b) Or leave this vendor-specific / or configurable via flag (if > >>> needed). > >>>>>> > >>>>>> Regards, > >>>>>> Alexey > >>>>>> > >>>>>> On Thu, Jul 27, 2023 at 3:27 AM Sergey Nuyanzin < > snuyan...@gmail.com> > >>>>>> wrote: > >>>>>> > >>>>>>> Hi Timo, > >>>>>>> > >>>>>>> Thanks for the FLIP. > >>>>>>> I also tend to think that Option 3 is better. > >>>>>>> > >>>>>>> I would be also interested in a question mentioned by Benchao Li. > >>>>>>> And a similar question about nested queries like > >>>>>>> `SELECT * FROM (SELECT $rowtime, * FROM t);` > >>>>>>> Am I right that it will show `$rowtime` in output ? > >>>>>>> > >>>>>>> > >>>>>>> On Thu, Jul 27, 2023 at 6:58 AM Benchao Li <libenc...@apache.org> > >>>>> wrote: > >>>>>>> > >>>>>>>> Hi Timo, > >>>>>>>> > >>>>>>>> Thanks for the FLIP, I also like the idea and option 3 sounds > >>>>>>>> good to > >>>>>>> me. > >>>>>>>> > >>>>>>>> I would like to discuss a case which is not mentioned in the > >>>>>>>> current > >>>>>>> FLIP. > >>>>>>>> How are the "System column"s expressed in intermediate result, > e.g. > >>>>>>> Join? > >>>>>>>> E.g. `SELECT * FROM t1 JOIN t2`, I guess it should not include > >>> "system > >>>>>>>> columns" from t1 and t2 as you proposed, and for `SELECT > >>>>>>>> t1.$rowtime, > >>>>> * > >>>>>>>> FROM t1 JOIN t2`, it should also be valid. > >>>>>>>> Then the question is how to you plan to implement the "system > >>>>> columns", > >>>>>>> do > >>>>>>>> we need to add it to `RelNode` level? Or we just need to do it > >>>>>>>> in the > >>>>>>>> parsing/validating phase? > >>>>>>>> I'm not sure that Calcite's "system column" feature is fully ready > >>> for > >>>>>>> this > >>>>>>>> since the code about this part is imported from the earlier > project > >>>>>>> before > >>>>>>>> it gets into Apache, and has not been considered much in the past > >>>>>>>> development. > >>>>>>>> > >>>>>>>> > >>>>>>>> Jing Ge <j...@ververica.com.invalid> 于2023年7月26日周三 00:01写 > >>>>>>>> 道: > >>>>>>>> > >>>>>>>>> Hi Timo, > >>>>>>>>> > >>>>>>>>> Thanks for your proposal. It is a very pragmatic feature. Among > >>>>>>>>> all > >>>>>>>> options > >>>>>>>>> in the FLIP, option 3 is one I prefer too and I'd like to ask > some > >>>>>>>>> questions to understand your thoughts. > >>>>>>>>> > >>>>>>>>> 1. I did some research on pseudo columns, just out of > >>>>>>>>> curiosity, do > >>>>>>> you > >>>>>>>>> know why most SQL systems do not need any prefix with their > pseudo > >>>>>>>> column? > >>>>>>>>> 2. Some platform providers will use ${variable_name} to define > >>>>>>>>> their > >>>>>>> own > >>>>>>>>> configurations and allow them to be embedded into SQL scripts. > >>>>>>>>> Will > >>>>>>> there > >>>>>>>>> be any conflict with option 3? > >>>>>>>>> > >>>>>>>>> Best regards, > >>>>>>>>> Jing > >>>>>>>>> > >>>>>>>>> On Tue, Jul 25, 2023 at 7:00 PM Konstantin Knauf > >>>>>>>>> <kna...@apache.org > >>>>>> > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Hi Timo, > >>>>>>>>>> > >>>>>>>>>> this makes sense to me. Option 3 seems reasonable, too. > >>>>>>>>>> > >>>>>>>>>> Cheers, > >>>>>>>>>> > >>>>>>>>>> Konstantin > >>>>>>>>>> > >>>>>>>>>> Am Di., 25. Juli 2023 um 12:53 Uhr schrieb Timo Walther < > >>>>>>>>>> twal...@apache.org > >>>>>>>>>>> : > >>>>>>>>>> > >>>>>>>>>>> Hi everyone, > >>>>>>>>>>> > >>>>>>>>>>> I would like to start a discussion about introducing the > concept > >>>>>>> of > >>>>>>>>>>> "System Columns" in SQL and Table API. > >>>>>>>>>>> > >>>>>>>>>>> The subject sounds bigger than it actually is. Luckily, Flink > >>>>> SQL > >>>>>>>>>>> already exposes the concept of metadata columns. And this > >>>>>>> proposal is > >>>>>>>>>>> just a slight adjustment for how metadata columns can be used > as > >>>>>>>> system > >>>>>>>>>>> columns. > >>>>>>>>>>> > >>>>>>>>>>> The biggest problem of metadata columns currently is that a > >>>>>>> catalog > >>>>>>>>>>> implementation can't provide them by default because they would > >>>>>>>> affect > >>>>>>>>>>> `SELECT *` when adding another one. > >>>>>>>>>>> > >>>>>>>>>>> Looking forward to your feedback on FLIP-348: > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>> > >>> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-348%3A+Support+System+Columns+in+SQL+and+Table+API > >>>>>>>>>>> > >>>>>>>>>>> Thanks, > >>>>>>>>>>> Timo > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> -- > >>>>>>>>>> https://twitter.com/snntrable > >>>>>>>>>> https://github.com/knaufk > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> -- > >>>>>>>> > >>>>>>>> Best, > >>>>>>>> Benchao Li > >>>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> Best regards, > >>>>>>> Sergey > >>>>>>> > >>>>>> > >>>>> > >>> > >>> > >> > > > >