Re: [DISCUSS] FLIP-348: Support System Columns in SQL and Table API

Jark Wu Thu, 17 Aug 2023 04:56:09 -0700

Hi Timo,

I'm fine with your latest suggestion that introducing a flag to control
expanding behavior of metadata virtual columns, but not introducing
any concept of system/pseudo columns for now.


Best,
Jark

On Tue, 15 Aug 2023 at 23:25, Timo Walther <[email protected]> wrote:

> Hi everyone,
>
> I would like to bump this thread up again.
>
> Esp. I would like to hear opinions on my latest suggestions to simply
> use METADATA VIRTUAL as system columns and only introduce a config
> option for the SELECT * behavior. Implementation-wise this means minimal
> effort and less new concepts.
>
> Looking forward to any kind of feedback.
>
> Thanks,
> Timo
>
> On 07.08.23 12:07, Timo Walther wrote:
> > Hi everyone,
> >
> > thanks for the various feedback and lively discussion. Sorry, for the
> > late reply as I was on vacation. Let me answer to some of the topics:
> >
> > 1) Systems columns should not be shown with DESCRIBE statements
> >
> > This sounds fine to me. I will update the FLIP in the next iteration.
> >
> > 2) Do you know why most SQL systems do not need any prefix with their
> > pseudo column?
> >
> > Because most systems do not have external catalogs or connectors. And
> > also the number of system columns is limited to a handful of columns.
> > Flink is more generic and thus more complex. And we have already the
> > concepts of metadata columns. We need to be careful with not overloading
> > our language.
> >
> > 3) Implementation details
> >
> >  > how to you plan to implement the "system columns", do we need to add
> > it to `RelNode` level? Or we just need to do it in the
> > parsing/validating phase?
> >  > I'm not sure that Calcite's "system column" feature is fully ready
> >
> > My plan would be to only modify the parsing/validating phase. I would
> > like to avoid additional complexity in planner rules and
> > connector/catalog interfaces. Metadata columns already support
> > projection push down and are passed through the stack (via Schema,
> > ResolvedSchema, SupportsReadableMetadata). Calcite's "system column"
> > feature is not fully ready yet and it would be a large effort
> > potentially introducing bugs in supporting it. Thus, I'm proposing to
> > leverage what we already have. The only part that needs to be modified
> > is the "expand star" method in SqlValidator and Table API.
> >
> > Queries such as `SELECT * FROM (SELECT $rowtime, * FROM t);` would show
> > $rowtime as the expand star has only a special case when in the scope
> > for `FROM t`. All further subqueries treat it as a regular column.
> >
> > 4) Built-in defined pseudo-column "$rowtime"
> >
> >  > Did you consider making it as a built-in defined pseudo-column
> > "$rowtime" which returns the time attribute value (if exists) or null
> > (if non-exists) for every table/query, and pseudo-column "$proctime"
> > always returns PROCTIME() value for each table/query
> >
> > Built-in pseudo-columns mean that connector or catalog providers need
> > consensus in Flink which pseudo-columns should be built-in. We should
> > keep the concept generic and let platform providers decide which pseudo
> > columns to expose. $rowtime might be obvious but others such as
> > $partition or $offset are tricky to get consensus as every external
> > connector works differently. Also a connector might want to expose
> > different time semantics (such as ingestion time).
> >
> > 5) Any operator can introduce system (psedo) columns.
> >
> > This is clearly out of scope for this FLIP. The implementation effort
> > would be huge and could introduce a lot of bugs.
> >
> > 6) "Metadata Key Prefix Constraint" which is still a little complex
> >
> > Another option could be to drop the naming pattern constraint. We could
> > make it configurable that METADATA VIRTUAL columns are never selected by
> > default in SELECT * or visible in DESCRIBE. This would further simplify
> > the FLIP and esp lower the impact on the planner and all interfaces.
> >
> > What do you think about this? We could introduce a flag:
> >
> > table.expand-metadata-columns (better name to be defined)
> >
> > This way we don't need to introduce the concept of system columns yet,
> > but can still offer similar functionality with minimal overhead in the
> > code base.
> >
> > Regards,
> > Timo
> >
> >
> >
> >
> > On 04.08.23 23:06, Alexey Leonov-Vendrovskiy wrote:
> >> Looks like both kinds of system columns can converge.
> >> We can say that any operator can introduce system (psedo) columns.
> >>
> >> cc Eugene who is also interested in the subject.
> >>
> >> On Wed, Aug 2, 2023 at 1:03 AM Paul Lam <[email protected]> wrote:
> >>
> >>> Hi Timo,
> >>>
> >>> Thanks for starting the discussion! System columns are no doubt a
> >>> good boost on Flink SQL’s usability, and I see the feedbacks are
> >>> mainly concerns about the accessibility of system columns.
> >>>
> >>> I think most of the concerns could be solved by clarifying the
> >>> ownership of the system columns. Different from databases like
> >>> Oracle/BigQuery/PG who owns the data/metadata, Flink uses the
> >>> data/metadata from external systems. That means Flink could
> >>> have 2 kinds of system columns (take ROWID for example):
> >>>
> >>> 1. system columns provided by external systems via catalogs, such
> >>>      as ROWID from the original system.
> >>> 2. system columns generated by Flink, such as ROWID generated by
> >>>      Flink itself.
> >>>
> >>> IIUC, the FLIP is proposing the 1st approach: the catalog defines what
> >>> system columns to provide, and Flink treats them as normal columns
> >>> with a special naming pattern.
> >>>
> >>> On the other hand, Jark is proposing the 2nd one: the system columns
> >>> are defined and owned by Flink, and can be inferred from external
> >>> systems. Therefore, system columns should be predefined by Flink,
> >>> and optionally implemented by the catalogs.
> >>>
> >>> Personally, I’m in favor of the 2nd approach, because it makes the
> >>> system columns very accessible and more aligned across the catalogs.
> >>>
> >>> BTW, I second Alexey that systems columns should not be shown with
> >>> DESCRIBE statements.
> >>>
> >>> WDYT? Thanks!
> >>>
> >>> Best,
> >>> Paul Lam
> >>>
> >>>> 2023年7月31日 23:54，Jark Wu <[email protected]> 写道：
> >>>>
> >>>> Hi Timo,
> >>>>
> >>>> Thanks for your proposal. I think this is a nice feature for users
> >>>> and I
> >>>> prefer option 3.
> >>>>
> >>>> I only have one concern about the concept of pseudo-column or
> >>>> system-column,
> >>>> because this is the first time we introduce it in Flink SQL. The
> >>>> confusion is similar to the
> >>>> question of Benchao and Sergey about the propagation of pseudo-column.
> >>>>
> >>>>  From my understanding, a pseudo-column can be get from an arbitrary
> >>> query,
> >>>> just similar to
> >>>> ROWNUM in Oracle[1], such as :
> >>>>
> >>>> SELECT *
> >>>> FROM (SELECT * FROM employees ORDER BY employee_id)
> >>>> WHERE ROWNUM < 11;
> >>>>
> >>>> However, IIUC, the proposed "$rowtime" pseudo-column can only be got
> >>>> from
> >>>> the physical table
> >>>> and can't be got from queries even if the query propagates the rowtime
> >>>> attribute. There was also
> >>>> a discussion about adding a pseudo-column "_proctime" [2] to make
> >>>> lookup
> >>>> join easier to use
> >>>> which can be got from arbitrary queries. That "_proctime" may conflict
> >>> with
> >>>> the proposed
> >>>> pseudo-column concept.
> >>>>
> >>>> Did you consider making it as a built-in defined pseudo-column
> >>>> "$rowtime"
> >>>> which returns the
> >>>> time attribute value (if exists) or null (if non-exists) for every
> >>>> table/query, and pseudo-column
> >>>> "$proctime" always returns PROCTIME() value for each table/query. In
> >>>> this
> >>>> way, catalogs only need
> >>>> to provide a default rowtime attribute and users can get it in the
> same
> >>>> way. And we don't need
> >>>> to introduce the contract interface of "Metadata Key Prefix
> Constraint"
> >>>> which is still a little complex
> >>>> for users and devs to understand.
> >>>>
> >>>> Best,
> >>>> Jark
> >>>>
> >>>> [1]:
> >>>>
> >>>
> https://docs.oracle.com/cd/E11882_01/server.112/e41084/pseudocolumns009.htm#SQLRF00255
> >>>> [2]: https://lists.apache.org/thread/7ln106qxyw8sp7ljq40hs2p1lb1gdwj5
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Fri, 28 Jul 2023 at 06:18, Alexey Leonov-Vendrovskiy <
> >>>> [email protected]> wrote:
> >>>>
> >>>>>>
> >>>>>> `SELECT * FROM (SELECT $rowtime, * FROM t);`
> >>>>>> Am I right that it will show `$rowtime` in output ?
> >>>>>
> >>>>>
> >>>>> Yes, all explicitly selected columns become a part of the result (and
> >>>>> intermediate) schema, and hence propagate.
> >>>>>
> >>>>> On Thu, Jul 27, 2023 at 2:40 PM Alexey Leonov-Vendrovskiy <
> >>>>> [email protected]> wrote:
> >>>>>
> >>>>>> Thank you, Timo, for starting this FLIP!
> >>>>>>
> >>>>>> I propose the following change:
> >>>>>>
> >>>>>> Remove the requirement that DESCRIBE need to show system columns.
> >>>>>>
> >>>>>>
> >>>>>> Some concrete vendor specific catalog implementations might prefer
> >>>>>> this
> >>>>>> approach.
> >>>>>> Usually the same system columns are available on all (or family) of
> >>>>>> tables, and it can be easily captured in the documentation.
> >>>>>>
> >>>>>> For example, BigQuery does exactly this: there, pseudo-columns do
> not
> >>>>> show
> >>>>>> up in the table schema in any place, but can be accessed via
> >>>>>> reference.
> >>>>>>
> >>>>>> So I propose we:
> >>>>>> a) Either we say that DESCRIBE doesn't show system columns,
> >>>>>> b) Or leave this vendor-specific / or configurable via flag (if
> >>> needed).
> >>>>>>
> >>>>>> Regards,
> >>>>>> Alexey
> >>>>>>
> >>>>>> On Thu, Jul 27, 2023 at 3:27 AM Sergey Nuyanzin <
> [email protected]>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi Timo,
> >>>>>>>
> >>>>>>> Thanks for the FLIP.
> >>>>>>> I also tend to think that Option 3 is better.
> >>>>>>>
> >>>>>>> I would be also interested in a question mentioned by Benchao Li.
> >>>>>>> And a similar question about nested queries like
> >>>>>>> `SELECT * FROM (SELECT $rowtime, * FROM t);`
> >>>>>>> Am I right that it will show `$rowtime` in output ?
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, Jul 27, 2023 at 6:58 AM Benchao Li <[email protected]>
> >>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi Timo,
> >>>>>>>>
> >>>>>>>> Thanks for the FLIP, I also like the idea and option 3 sounds
> >>>>>>>> good to
> >>>>>>> me.
> >>>>>>>>
> >>>>>>>> I would like to discuss a case which is not mentioned in the
> >>>>>>>> current
> >>>>>>> FLIP.
> >>>>>>>> How are the "System column"s expressed in intermediate result,
> e.g.
> >>>>>>> Join?
> >>>>>>>> E.g. `SELECT * FROM t1 JOIN t2`, I guess it should not include
> >>> "system
> >>>>>>>> columns" from t1 and t2 as you proposed, and for `SELECT
> >>>>>>>> t1.$rowtime,
> >>>>> *
> >>>>>>>> FROM t1 JOIN t2`, it should also be valid.
> >>>>>>>> Then the question is how to you plan to implement the "system
> >>>>> columns",
> >>>>>>> do
> >>>>>>>> we need to add it to `RelNode` level? Or we just need to do it
> >>>>>>>> in the
> >>>>>>>> parsing/validating phase?
> >>>>>>>> I'm not sure that Calcite's "system column" feature is fully ready
> >>> for
> >>>>>>> this
> >>>>>>>> since the code about this part is imported from the earlier
> project
> >>>>>>> before
> >>>>>>>> it gets into Apache, and has not been considered much in the past
> >>>>>>>> development.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Jing Ge <[email protected]> 于2023年7月26日周三 00:01写
> >>>>>>>> 道：
> >>>>>>>>
> >>>>>>>>> Hi Timo,
> >>>>>>>>>
> >>>>>>>>> Thanks for your proposal. It is a very pragmatic feature. Among
> >>>>>>>>> all
> >>>>>>>> options
> >>>>>>>>> in the FLIP, option 3 is one I prefer too and I'd like to ask
> some
> >>>>>>>>> questions to understand your thoughts.
> >>>>>>>>>
> >>>>>>>>> 1. I did some research on pseudo columns, just out of
> >>>>>>>>> curiosity, do
> >>>>>>> you
> >>>>>>>>> know why most SQL systems do not need any prefix with their
> pseudo
> >>>>>>>> column?
> >>>>>>>>> 2. Some platform providers will use ${variable_name} to define
> >>>>>>>>> their
> >>>>>>> own
> >>>>>>>>> configurations and allow them to be embedded into SQL scripts.
> >>>>>>>>> Will
> >>>>>>> there
> >>>>>>>>> be any conflict with option 3?
> >>>>>>>>>
> >>>>>>>>> Best regards,
> >>>>>>>>> Jing
> >>>>>>>>>
> >>>>>>>>> On Tue, Jul 25, 2023 at 7:00 PM Konstantin Knauf
> >>>>>>>>> <[email protected]
> >>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi Timo,
> >>>>>>>>>>
> >>>>>>>>>> this makes sense to me. Option 3 seems reasonable, too.
> >>>>>>>>>>
> >>>>>>>>>> Cheers,
> >>>>>>>>>>
> >>>>>>>>>> Konstantin
> >>>>>>>>>>
> >>>>>>>>>> Am Di., 25. Juli 2023 um 12:53 Uhr schrieb Timo Walther <
> >>>>>>>>>> [email protected]
> >>>>>>>>>>> :
> >>>>>>>>>>
> >>>>>>>>>>> Hi everyone,
> >>>>>>>>>>>
> >>>>>>>>>>> I would like to start a discussion about introducing the
> concept
> >>>>>>> of
> >>>>>>>>>>> "System Columns" in SQL and Table API.
> >>>>>>>>>>>
> >>>>>>>>>>> The subject sounds bigger than it actually is. Luckily, Flink
> >>>>> SQL
> >>>>>>>>>>> already exposes the concept of metadata columns. And this
> >>>>>>> proposal is
> >>>>>>>>>>> just a slight adjustment for how metadata columns can be used
> as
> >>>>>>>> system
> >>>>>>>>>>> columns.
> >>>>>>>>>>>
> >>>>>>>>>>> The biggest problem of metadata columns currently is that a
> >>>>>>> catalog
> >>>>>>>>>>> implementation can't provide them by default because they would
> >>>>>>>> affect
> >>>>>>>>>>> `SELECT *` when adding another one.
> >>>>>>>>>>>
> >>>>>>>>>>> Looking forward to your feedback on FLIP-348:
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-348%3A+Support+System+Columns+in+SQL+and+Table+API
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> Timo
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> https://twitter.com/snntrable
> >>>>>>>>>> https://github.com/knaufk
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Benchao Li
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Best regards,
> >>>>>>> Sergey
> >>>>>>>
> >>>>>>
> >>>>>
> >>>
> >>>
> >>
> >
>
>

Re: [DISCUSS] FLIP-348: Support System Columns in SQL and Table API

Reply via email to