Re: [DISCUSS] FLIP-520: Simplify StructuredType handling

Timo Walther Tue, 29 Apr 2025 02:47:09 -0700

Hi Hao,

thanks for your feedback. Currently, the Python API does not supportstructured types at all. So this might require another FLIP in thefuture with a dedicated effort. However, I have considered Pythonalready in the design. Also Python classes are discoverable from a fullyqualified class path:


>>> import importlib
>>> module = importlib.import_module('pyflink.table')
>>> MyClass = getattr(module, 'TableEnvironment')
>>> MyClass
<class 'pyflink.table.table_environment.TableEnvironment'>
>>> MyClass()

So a class string can be used for both Java and Python. And we shouldalso encourage users to use the same package/name across languages. Ingeneral I don't expect that users mix Java and Python. If they do, wecan also offer a package resolution proxy API in the future.


I hope this answers your question.

Thanks,
Timo


On 24.04.25 19:18, Hao Li wrote:

Hi Timo,

Thanks for the clarification. It's very helpful.

For the classpath, I suppose it can also support Python later if it's
called in Python table api? Do we want to indicate if it's Java classpath
or Python class? Or we support a list of classpath which can consist both
Python, Java or other languages later?

Thanks,
Hao

On Thu, Apr 24, 2025 at 5:34 AM Arvid Heise <ar...@apache.org> wrote:

Hi Timo,

thank you very much for responding. I see that this is just the first step
to get consistency between SQL and Table API and more work is to come.

I still think that there is some redundancy between STRUCT and ROW but tbh
I have more issues with ROW than with STRUCT. (What is even the meaning of
a nested ROW?)

So +1 with your proposal and maybe we can deprecate ROW at some later point
in time.

Best,

Arvid

On Thu, Apr 24, 2025 at 11:57 AM Timo Walther <twal...@apache.org> wrote:

Hi Arvid, Hi Hao,

thanks for this valuable feedback. Let me clarify a few things before I
go into the details.

Just to avoid any confusion: the FLIP does not propose introducing the
StructuredType. Structured types backed by classes already exist in
Flink for years and are already supported in UDFs, Table.collect(),
StreamTableEnvironment.toDataStream, and connectors. Structured types
have been introduced for a better programmatic story in Table API. They
avoid the need for manually defining the full schema at the edges.
Manual schema work is annoying and with structured types it is possible
to use classes whereever a type is expected.

The goal of this FLIP only to bring Table API and SQL closer together.
In general, this is only the first step of my larger vision of
structured data handling. There are basically 3 kinds of structured

types:


1) a typed, fixed field struct like STRUCTURED<'Money', i INT, s STRING>
2) an untyped, fixed field struct like STRUCTURED<i INT, s STRING>
(similar to Snowflake OBJECT(i INT, s STRING))
3) an untyped struct for semi-structured data like STRUCTURED (similar
to Snowflake OBJECT)

RowType represents 2), StructuredType represents 1) and a future
semi-structured type can represent 3) (but out of scope for this FLIP).

If we don't support a typed struct, Money(i INT) and User(i INT) are not
distinct in SQL. For table.collect() or eval(Row row) in UDFs, it would
mean that those need the full schema declaration in order to map to a
target type. Structured types avoid all of that and make Table API very
powerful.

Usually both the UDF and the collect()/toDataStream() are defined in the
same Table API program. Thus, the class is usually present in the same
classpath and this becomes less of an issue in production. Casting
structured types to ROW is also supported.

The implementation effort of this FLIP is very low. It's mostly intended
to fill missing gaps, no major overhaul of the type system. Also to
avoid any backwards compatibility issues.

Let me know what you think.

Cheers,
Timo

On 23.04.25 21:27, Hao Li wrote:

I think Arvid has a good point. Why not define Object type without

class

and when you get it in table api, try to cast it to some class? I found

https://docs.oracle.com/javase/1.5.0/docs/guide/jdbc/getstart/mapping.html

Under `JAVA_OBJECT` type section. They have:

```

ResultSet rs = stmt.executeQuery("SELECT ENGINEERS FROM PERSONNEL");
while (rs.next()) {
Engineer eng = (Engineer)rs.getObject("ENGINEERS");
System.out.println(eng.lastName + ", " + eng.firstName);
}

```

For us, how about add `getFieldAs(int post, Class class)` method in Row
type? Your example:

```

TableEnvironment env = ...

Table t = env.sqlQuery("SELECT OBJECT_OF('com.example.User', 'name',

'Bob',

'age', 42)");

// Tries to resolve `com.example.User` in the classpath, if not present
returns `Row`
t.execute().collect();
```

Will be
```
TableEnvironment env = ...

Table t = env.sqlQuery("SELECT OBJECT_OF('name', 'Bob', 'age', 42)");

// Tries to resolve `com.example.User` in the classpath, if not present
returns `Row`
For (Row row : t.execute().collect()) {
      User user = row.getFieldAs(0, User.class);
}
```

For Arvid's question: "However, at that point, why do we actually need
anything beyond ROW?"

Maybe the difference is Row type shouldn't support to be casted as user
defined class but `StructuredType` can be.

Thanks,
Hao

On Wed, Apr 23, 2025 at 2:04 AM Arvid Heise

<ahe...@confluent.io.invalid


wrote:

Hi Timo,

thanks for addressing my points. I'm not set on using STRUCT et al.

but

wanted to point out the alternatives.

Regarding the attached class name, I have similar confusion to Hao. I
wonder if Structures types shouldn't be anonymous by default in the

sense

that initially we don't attach a class name to it. As you pointed out,

it

has no real semantics in SQL and we can't validate it.
Another thing to consider is that if one user creates a table through

some

means and another user wants to consume it, the second user may not

have

access to the class as is. But the user could easily create a

compatible

class on its own.

Consequently, I'm thinking about getting rid of the type at all. Only

on

the edges, we can use conversion to the user types when users actually
access the ROW:
* Any table API access that wants to collect results (in your last

example

what is t.execute().collect(); returning? How does that work in the
multi-user setup sketched above? Wouldn't it be easier that the

consumer

explicitly gives us the POJO type that it expects?)
* Any DataStream conversion
* Any UDF

However, at that point, why do we actually need anything beyond ROW?

Best,

Arvid

On Wed, Apr 23, 2025 at 8:52 AM Timo Walther <twal...@apache.org>

wrote:

Hi Hao,

1. Can `StructuredType` be nested?

Yes this is supported.

2. What's the main reason the class won't be enforced in SQL?

SQL should not care about classes. Within the SQL ecosystem, the SQL
engine controls the data serialization and protocols. The SQL engine
will not load the class. Classes are a concept of a JVM or Python API
endpoint. This also the reason why a SQL ARRAY<BIGINT> can be
represented as List<Long>, long[], Long[]. The latter are only

concepts

in the target programming language and might look different in

Python.


Regard,
Timo


On 22.04.25 23:54, Hao Li wrote:

Hi Timo,

Thanks for the FLIP. +1 with a few questions:

1. Can `StructuredType` be nested? e.g.

`STRUCTURED<'com.example.User',

name STRING, age INT NOT NULL, address

STRUCTURED<'com.example.address',

street STRING, zip STRING>>`

2. What's the main reason the class won't be enforced in SQL? Since

tables

created in SQL can also be used in Table API, will it come as a

surprise

if

it's working in SQL and then failing in Table API? What if
`com.example.User` was not validated in SQL when creating table,

then

the

class was created for something else with different fields and then

in

Table api, it's not compatible.

Hao

On Tue, Apr 22, 2025 at 9:39 AM Timo Walther <twal...@apache.org>

wrote:

Hi Arvid, Hi Sergey,

thanks for your feedback. I updated the FLIP accordingly but let me
answer your questions
here as well:

    > Are we going to enforce that the name is a valid class name?

What

is

    > happening if it's not a correct name?
    > What are the implications of using a class that is not in the
    > classpath in Table API? It looks to me that the name is

metadata-only

    > until we try to access the objects directly in

Table/DataStream

API.


Names are not enforced or validated. They are pure metadata as

mentioned

in Section 2.1. We fallback to Row as the conversion class if the

name

cannot be resolved in the current classpath. So when staying in the

SQL

ecosystem (i.e. not switching to Table API, DataStream API, or

UDFs),

the class must not be present.

    > Should Expressions.objectOf(String, Object... kv); also have

an

    > overload where you can put in the StructuredType in case where
    > the class is not in the CP?

That makes a lot of sense. I added a DataTypes.STRUCTURED(String,
Field...) method and a Expressions.objectOf(String, Object...).

    > What is the expected outcome of supplying fewer keys than

defined

    > in the structured type? Are we going to make use of

nullability

here?

    > If so, *_INSERT and *_REMOVE may have some use.

Currently, we go with the most conservative approach, which means

that

all keys need to be present. Maybe we can reserve this feature to

future

work and make the logic more lenient.

    > Talking about nullability: Is there some option to make the

declared

    > fields NOT NULL? If so, could you amend one example to show

that?

    > (Grammar? implies that it's not possible)

NOT NULL is supported similar to ROW<i INT NOT NULL>. I adjusted

one

of

the examples.

    > One bigger concern is around the naming. For me, OBJECT is

used

for

    > semi-structured types that are open. Your FLIP implies a

closed

design

    > and that you want to add an open OBJECT later. I asked ChatGPT

about

    > other DB implementations and it seems like STRUCT is used more

often

    > (see below). So, I'd propose to call it STRUCT<...>,

STRUCT_OF,

    > structOf, UPDATE_STRUCT, and updateStruct respectively.

Naming is hard. I was also torn between STRUCT, STRUCTURED, or

OBJECT.

In Flink, the ROW type is rather our STRUCT type, because it works

fully

position based. Structured types might be name-based in the future

for

better schema evolution, so they rather model an OBJECT type. This

was

my reason for choosing OBJECT_OF (typed to class name and fixed

fields)

vs. OBJECT (semi-structured without fixed fields). Snowflake also

uses

OBJECT(i INT) (for structured types) and OBJECT (for semi

structured

types).

Also, both structured and semi-structured types can then share

functions

such as UPDATE_OBJECT().

What do others think?

Thanks,
Timo

On 22.04.25 12:08, Sergey Nuyanzin wrote:

Thanks for driving this Timo

The FLIP seems reasonable to me

I have one minor question/clarification
do I understand it correct that after this FLIP we can execute of
`typeof` against  result of `OBJECT_OF`
for instance
SELECT typeof(OBJECT_OF(
      'com.example.User',
      'name', 'Bob',
      'age', 42
));

should return `STRUCTURED<'com.example.User', name STRING, age

INT>`

?

On Tue, Apr 22, 2025 at 10:57 AM Timo Walther <twal...@apache.org

wrote:


Hi everyone,

I would like to ask again for feedback on this FLIP. It is a

rather

small change but with big impact on usability for structured

data.


Are there any objections? Otherwise I would like to continue with

voting

soon.

Thanks,
Timo

On 10.04.25 07:54, Timo Walther wrote:

Hi everyone,

I would like to start a discussion about FLIP-520: Simplify
StructuredType handling [1].

Flink SQL already supports structured types in the engine,

serializers,

UDFs, and connector interfaces. However, currently only Table

API

was

able to make use of them. While UDFs can take objects as input

and

return types, it is actually quite inconvenient to use them in
transformations.

This FLIP fixes some immediate blockers in the use of structured

types.


Looking forward to feedback.

Cheers,
Timo


[1] https://cwiki.apache.org/confluence/display/FLINK/
FLIP-520%3A+Simplify+StructuredType+handling

Re: [DISCUSS] FLIP-520: Simplify StructuredType handling

Reply via email to