Re: [DISCUSS] FLIP-348: Support System Columns in SQL and Table API

Timo Walther Tue, 29 Aug 2023 02:19:58 -0700

Thanks everyone for the positive feedback.

I updated the FLIP with the proposed minimal solution:


https://cwiki.apache.org/confluence/display/FLINK/FLIP-348%3A+Make+expanding+behavior+of+virtual+metadata+columns+configurable

I also changed the title to not cause any confusion with the concept ofsystem columns.

Now the FLIP only introduces a new ConfigOption and does not specifyconstraints on the naming of metadata columns anymore.


If there are no objections, I would start a voting thread by tomorrow.

Thanks,
Timo


On 29.08.23 00:21, Alexey Leonov-Vendrovskiy wrote:

SGTM

For future reference, responding here:

5) Any operator can introduce system (psedo) columns.


This is clearly out of scope for this FLIP. The implementation effort
would be huge and could introduce a lot of bugs.


  I didn't imply any specific implementation or feature coverage in the
currently proposed FLIP, but rather as a way to describe the semantics of
system columns that those could be (but don't have to) introduced by any
operator.

Thanks,
Alexey


On Tue, Aug 22, 2023 at 2:18 PM Jing Ge <[email protected]> wrote:

Hi Timo,

Your last suggestion sounds good.

Best regards,
Jing

On Mon, Aug 21, 2023 at 4:21 AM Benchao Li <[email protected]> wrote:

It sounds good to me too, that we avoid introducing the concept of

"system

columns" for now.

Timo Walther <[email protected]> 于2023年8月18日周五 22:38写道：

Great, I also like my last suggestion as it is even more elegant. I

will

update the FLIP until Monday.

Regards,
Timo

On 17.08.23 13:55, Jark Wu wrote:

Hi Timo,

I'm fine with your latest suggestion that introducing a flag to

control

expanding behavior of metadata virtual columns, but not introducing
any concept of system/pseudo columns for now.

Best,
Jark

On Tue, 15 Aug 2023 at 23:25, Timo Walther <[email protected]>

wrote:

Hi everyone,

I would like to bump this thread up again.

Esp. I would like to hear opinions on my latest suggestions to

simply

use METADATA VIRTUAL as system columns and only introduce a config
option for the SELECT * behavior. Implementation-wise this means

minimal

effort and less new concepts.

Looking forward to any kind of feedback.

Thanks,
Timo

On 07.08.23 12:07, Timo Walther wrote:

Hi everyone,

thanks for the various feedback and lively discussion. Sorry, for

the

late reply as I was on vacation. Let me answer to some of the

topics:


1) Systems columns should not be shown with DESCRIBE statements

This sounds fine to me. I will update the FLIP in the next

iteration.


2) Do you know why most SQL systems do not need any prefix with

their

pseudo column?

Because most systems do not have external catalogs or connectors.

And

also the number of system columns is limited to a handful of

columns.

Flink is more generic and thus more complex. And we have already

the

concepts of metadata columns. We need to be careful with not

overloading

our language.

3) Implementation details

   > how to you plan to implement the "system columns", do we need

to

add

it to `RelNode` level? Or we just need to do it in the
parsing/validating phase?
   > I'm not sure that Calcite's "system column" feature is fully

ready


My plan would be to only modify the parsing/validating phase. I

would

like to avoid additional complexity in planner rules and
connector/catalog interfaces. Metadata columns already support
projection push down and are passed through the stack (via Schema,
ResolvedSchema, SupportsReadableMetadata). Calcite's "system

column"

feature is not fully ready yet and it would be a large effort
potentially introducing bugs in supporting it. Thus, I'm proposing

to

leverage what we already have. The only part that needs to be

modified

is the "expand star" method in SqlValidator and Table API.

Queries such as `SELECT * FROM (SELECT $rowtime, * FROM t);` would

show

$rowtime as the expand star has only a special case when in the

scope

for `FROM t`. All further subqueries treat it as a regular column.

4) Built-in defined pseudo-column "$rowtime"

   > Did you consider making it as a built-in defined pseudo-column
"$rowtime" which returns the time attribute value (if exists) or

null

(if non-exists) for every table/query, and pseudo-column

"$proctime"

always returns PROCTIME() value for each table/query

Built-in pseudo-columns mean that connector or catalog providers

need

consensus in Flink which pseudo-columns should be built-in. We

should

keep the concept generic and let platform providers decide which

pseudo

columns to expose. $rowtime might be obvious but others such as
$partition or $offset are tricky to get consensus as every external
connector works differently. Also a connector might want to expose
different time semantics (such as ingestion time).

5) Any operator can introduce system (psedo) columns.

This is clearly out of scope for this FLIP. The implementation

effort

would be huge and could introduce a lot of bugs.

6) "Metadata Key Prefix Constraint" which is still a little complex

Another option could be to drop the naming pattern constraint. We

could

make it configurable that METADATA VIRTUAL columns are never

selected

by

default in SELECT * or visible in DESCRIBE. This would further

simplify

the FLIP and esp lower the impact on the planner and all

interfaces.


What do you think about this? We could introduce a flag:

table.expand-metadata-columns (better name to be defined)

This way we don't need to introduce the concept of system columns

yet,

but can still offer similar functionality with minimal overhead in

the

code base.

Regards,
Timo




On 04.08.23 23:06, Alexey Leonov-Vendrovskiy wrote:

Looks like both kinds of system columns can converge.
We can say that any operator can introduce system (psedo) columns.

cc Eugene who is also interested in the subject.

On Wed, Aug 2, 2023 at 1:03 AM Paul Lam <[email protected]>

wrote:

Hi Timo,

Thanks for starting the discussion! System columns are no doubt a
good boost on Flink SQL’s usability, and I see the feedbacks are
mainly concerns about the accessibility of system columns.

I think most of the concerns could be solved by clarifying the
ownership of the system columns. Different from databases like
Oracle/BigQuery/PG who owns the data/metadata, Flink uses the
data/metadata from external systems. That means Flink could
have 2 kinds of system columns (take ROWID for example):

1. system columns provided by external systems via catalogs, such
       as ROWID from the original system.
2. system columns generated by Flink, such as ROWID generated by
       Flink itself.

IIUC, the FLIP is proposing the 1st approach: the catalog defines

what

system columns to provide, and Flink treats them as normal

columns

with a special naming pattern.

On the other hand, Jark is proposing the 2nd one: the system

columns

are defined and owned by Flink, and can be inferred from external
systems. Therefore, system columns should be predefined by Flink,
and optionally implemented by the catalogs.

Personally, I’m in favor of the 2nd approach, because it makes

the

system columns very accessible and more aligned across the

catalogs.


BTW, I second Alexey that systems columns should not be shown

with

DESCRIBE statements.

WDYT? Thanks!

Best,
Paul Lam

2023年7月31日 23:54，Jark Wu <[email protected]> 写道：

Hi Timo,

Thanks for your proposal. I think this is a nice feature for

users

and I
prefer option 3.

I only have one concern about the concept of pseudo-column or
system-column,
because this is the first time we introduce it in Flink SQL. The
confusion is similar to the
question of Benchao and Sergey about the propagation of

pseudo-column.


   From my understanding, a pseudo-column can be get from an

arbitrary

query,

just similar to
ROWNUM in Oracle[1], such as :

SELECT *
FROM (SELECT * FROM employees ORDER BY employee_id)
WHERE ROWNUM < 11;

However, IIUC, the proposed "$rowtime" pseudo-column can only be

got

from
the physical table
and can't be got from queries even if the query propagates the

rowtime

attribute. There was also
a discussion about adding a pseudo-column "_proctime" [2] to

make

lookup
join easier to use
which can be got from arbitrary queries. That "_proctime" may

conflict

with

the proposed
pseudo-column concept.

Did you consider making it as a built-in defined pseudo-column
"$rowtime"
which returns the
time attribute value (if exists) or null (if non-exists) for

every

table/query, and pseudo-column
"$proctime" always returns PROCTIME() value for each

table/query.

In

this
way, catalogs only need
to provide a default rowtime attribute and users can get it in

the

same

way. And we don't need
to introduce the contract interface of "Metadata Key Prefix

Constraint"

which is still a little complex
for users and devs to understand.

Best,
Jark

[1]:

https://docs.oracle.com/cd/E11882_01/server.112/e41084/pseudocolumns009.htm#SQLRF00255

[2]:

https://lists.apache.org/thread/7ln106qxyw8sp7ljq40hs2p1lb1gdwj5





On Fri, 28 Jul 2023 at 06:18, Alexey Leonov-Vendrovskiy <
[email protected]> wrote:


`SELECT * FROM (SELECT $rowtime, * FROM t);`
Am I right that it will show `$rowtime` in output ?



Yes, all explicitly selected columns become a part of the

result

(and

intermediate) schema, and hence propagate.

On Thu, Jul 27, 2023 at 2:40 PM Alexey Leonov-Vendrovskiy <
[email protected]> wrote:

Thank you, Timo, for starting this FLIP!

I propose the following change:

Remove the requirement that DESCRIBE need to show system

columns.



Some concrete vendor specific catalog implementations might

prefer

this
approach.
Usually the same system columns are available on all (or

family)

of

tables, and it can be easily captured in the documentation.

For example, BigQuery does exactly this: there, pseudo-columns

do

not

show

up in the table schema in any place, but can be accessed via
reference.

So I propose we:
a) Either we say that DESCRIBE doesn't show system columns,
b) Or leave this vendor-specific / or configurable via flag

(if

needed).


Regards,
Alexey

On Thu, Jul 27, 2023 at 3:27 AM Sergey Nuyanzin <

[email protected]>

wrote:

Hi Timo,

Thanks for the FLIP.
I also tend to think that Option 3 is better.

I would be also interested in a question mentioned by Benchao

Li.

And a similar question about nested queries like
`SELECT * FROM (SELECT $rowtime, * FROM t);`
Am I right that it will show `$rowtime` in output ?


On Thu, Jul 27, 2023 at 6:58 AM Benchao Li <

[email protected]

wrote:

Hi Timo,

Thanks for the FLIP, I also like the idea and option 3

sounds

good to

me.


I would like to discuss a case which is not mentioned in the
current

FLIP.

How are the "System column"s expressed in intermediate

result,

e.g.

Join?

E.g. `SELECT * FROM t1 JOIN t2`, I guess it should not

include

"system

columns" from t1 and t2 as you proposed, and for `SELECT
t1.$rowtime,

FROM t1 JOIN t2`, it should also be valid.
Then the question is how to you plan to implement the

"system

columns",

do

we need to add it to `RelNode` level? Or we just need to do

it

in the
parsing/validating phase?
I'm not sure that Calcite's "system column" feature is fully

ready

for

this

since the code about this part is imported from the earlier

project

before

it gets into Apache, and has not been considered much in the

past

development.


Jing Ge <[email protected]> 于2023年7月26日周三 00:01写
道：

Hi Timo,

Thanks for your proposal. It is a very pragmatic feature.

Among

all

options

in the FLIP, option 3 is one I prefer too and I'd like to

ask

some

questions to understand your thoughts.

1. I did some research on pseudo columns, just out of
curiosity, do

you

know why most SQL systems do not need any prefix with their

pseudo

column?

2. Some platform providers will use ${variable_name} to

define

their

own

configurations and allow them to be embedded into SQL

scripts.

Will

there

be any conflict with option 3?

Best regards,
Jing

On Tue, Jul 25, 2023 at 7:00 PM Konstantin Knauf
<[email protected]

wrote:

Hi Timo,

this makes sense to me. Option 3 seems reasonable, too.

Cheers,

Konstantin

Am Di., 25. Juli 2023 um 12:53 Uhr schrieb Timo Walther <
[email protected]

Hi everyone,

I would like to start a discussion about introducing the

concept

of

"System Columns" in SQL and Table API.

The subject sounds bigger than it actually is. Luckily,

Flink

SQL

already exposes the concept of metadata columns. And this

proposal is

just a slight adjustment for how metadata columns can be

used

as

system

columns.

The biggest problem of metadata columns currently is

that a

catalog

implementation can't provide them by default because they

would

affect

`SELECT *` when adding another one.

Looking forward to your feedback on FLIP-348:

https://cwiki.apache.org/confluence/display/FLINK/FLIP-348%3A+Support+System+Columns+in+SQL+and+Table+API


Thanks,
Timo



--
https://twitter.com/snntrable
https://github.com/knaufk



--

Best,
Benchao Li



--
Best regards,
Sergey


--

Best,
Benchao Li

Re: [DISCUSS] FLIP-348: Support System Columns in SQL and Table API

Reply via email to