Hi Timo,

thank you very much for responding. I see that this is just the first step
to get consistency between SQL and Table API and more work is to come.

I still think that there is some redundancy between STRUCT and ROW but tbh
I have more issues with ROW than with STRUCT. (What is even the meaning of
a nested ROW?)

So +1 with your proposal and maybe we can deprecate ROW at some later point
in time.

Best,

Arvid

On Thu, Apr 24, 2025 at 11:57 AM Timo Walther <twal...@apache.org> wrote:

> Hi Arvid, Hi Hao,
>
> thanks for this valuable feedback. Let me clarify a few things before I
> go into the details.
>
> Just to avoid any confusion: the FLIP does not propose introducing the
> StructuredType. Structured types backed by classes already exist in
> Flink for years and are already supported in UDFs, Table.collect(),
> StreamTableEnvironment.toDataStream, and connectors. Structured types
> have been introduced for a better programmatic story in Table API. They
> avoid the need for manually defining the full schema at the edges.
> Manual schema work is annoying and with structured types it is possible
> to use classes whereever a type is expected.
>
> The goal of this FLIP only to bring Table API and SQL closer together.
> In general, this is only the first step of my larger vision of
> structured data handling. There are basically 3 kinds of structured types:
>
> 1) a typed, fixed field struct like STRUCTURED<'Money', i INT, s STRING>
> 2) an untyped, fixed field struct like STRUCTURED<i INT, s STRING>
> (similar to Snowflake OBJECT(i INT, s STRING))
> 3) an untyped struct for semi-structured data like STRUCTURED (similar
> to Snowflake OBJECT)
>
> RowType represents 2), StructuredType represents 1) and a future
> semi-structured type can represent 3) (but out of scope for this FLIP).
>
> If we don't support a typed struct, Money(i INT) and User(i INT) are not
> distinct in SQL. For table.collect() or eval(Row row) in UDFs, it would
> mean that those need the full schema declaration in order to map to a
> target type. Structured types avoid all of that and make Table API very
> powerful.
>
> Usually both the UDF and the collect()/toDataStream() are defined in the
> same Table API program. Thus, the class is usually present in the same
> classpath and this becomes less of an issue in production. Casting
> structured types to ROW is also supported.
>
> The implementation effort of this FLIP is very low. It's mostly intended
> to fill missing gaps, no major overhaul of the type system. Also to
> avoid any backwards compatibility issues.
>
> Let me know what you think.
>
> Cheers,
> Timo
>
> On 23.04.25 21:27, Hao Li wrote:
> > I think Arvid has a good point. Why not define Object type without class
> > and when you get it in table api, try to cast it to some class? I found
> >
> https://docs.oracle.com/javase/1.5.0/docs/guide/jdbc/getstart/mapping.html
> .
> > Under `JAVA_OBJECT` type section. They have:
> >
> > ```
> >
> > ResultSet rs = stmt.executeQuery("SELECT ENGINEERS FROM PERSONNEL");
> > while (rs.next()) {
> > Engineer eng = (Engineer)rs.getObject("ENGINEERS");
> > System.out.println(eng.lastName + ", " + eng.firstName);
> > }
> >
> > ```
> >
> > For us, how about add `getFieldAs(int post, Class class)` method in Row
> > type? Your example:
> >
> > ```
> >
> > TableEnvironment env = ...
> >
> > Table t = env.sqlQuery("SELECT OBJECT_OF('com.example.User', 'name',
> 'Bob',
> > 'age', 42)");
> >
> > // Tries to resolve `com.example.User` in the classpath, if not present
> > returns `Row`
> > t.execute().collect();
> > ```
> >
> > Will be
> > ```
> > TableEnvironment env = ...
> >
> > Table t = env.sqlQuery("SELECT OBJECT_OF('name', 'Bob', 'age', 42)");
> >
> > // Tries to resolve `com.example.User` in the classpath, if not present
> > returns `Row`
> > For (Row row : t.execute().collect()) {
> >      User user = row.getFieldAs(0, User.class);
> > }
> > ```
> >
> > For Arvid's question: "However, at that point, why do we actually need
> > anything beyond ROW?"
> >
> > Maybe the difference is Row type shouldn't support to be casted as user
> > defined class but `StructuredType` can be.
> >
> > Thanks,
> > Hao
> >
> > On Wed, Apr 23, 2025 at 2:04 AM Arvid Heise <ahe...@confluent.io.invalid
> >
> > wrote:
> >
> >> Hi Timo,
> >>
> >> thanks for addressing my points. I'm not set on using STRUCT et al. but
> >> wanted to point out the alternatives.
> >>
> >> Regarding the attached class name, I have similar confusion to Hao. I
> >> wonder if Structures types shouldn't be anonymous by default in the
> sense
> >> that initially we don't attach a class name to it. As you pointed out,
> it
> >> has no real semantics in SQL and we can't validate it.
> >> Another thing to consider is that if one user creates a table through
> some
> >> means and another user wants to consume it, the second user may not have
> >> access to the class as is. But the user could easily create a compatible
> >> class on its own.
> >>
> >> Consequently, I'm thinking about getting rid of the type at all. Only on
> >> the edges, we can use conversion to the user types when users actually
> >> access the ROW:
> >> * Any table API access that wants to collect results (in your last
> example
> >> what is t.execute().collect(); returning? How does that work in the
> >> multi-user setup sketched above? Wouldn't it be easier that the consumer
> >> explicitly gives us the POJO type that it expects?)
> >> * Any DataStream conversion
> >> * Any UDF
> >>
> >> However, at that point, why do we actually need anything beyond ROW?
> >>
> >> Best,
> >>
> >> Arvid
> >>
> >> On Wed, Apr 23, 2025 at 8:52 AM Timo Walther <twal...@apache.org>
> wrote:
> >>
> >>> Hi Hao,
> >>>
> >>> 1. Can `StructuredType` be nested?
> >>>
> >>> Yes this is supported.
> >>>
> >>> 2. What's the main reason the class won't be enforced in SQL?
> >>>
> >>> SQL should not care about classes. Within the SQL ecosystem, the SQL
> >>> engine controls the data serialization and protocols. The SQL engine
> >>> will not load the class. Classes are a concept of a JVM or Python API
> >>> endpoint. This also the reason why a SQL ARRAY<BIGINT> can be
> >>> represented as List<Long>, long[], Long[]. The latter are only concepts
> >>> in the target programming language and might look different in Python.
> >>>
> >>> Regard,
> >>> Timo
> >>>
> >>>
> >>> On 22.04.25 23:54, Hao Li wrote:
> >>>> Hi Timo,
> >>>>
> >>>> Thanks for the FLIP. +1 with a few questions:
> >>>>
> >>>> 1. Can `StructuredType` be nested? e.g.
> `STRUCTURED<'com.example.User',
> >>>> name STRING, age INT NOT NULL, address
> >> STRUCTURED<'com.example.address',
> >>>> street STRING, zip STRING>>`
> >>>>
> >>>> 2. What's the main reason the class won't be enforced in SQL? Since
> >>> tables
> >>>> created in SQL can also be used in Table API, will it come as a
> >> surprise
> >>> if
> >>>> it's working in SQL and then failing in Table API? What if
> >>>> `com.example.User` was not validated in SQL when creating table, then
> >> the
> >>>> class was created for something else with different fields and then in
> >>>> Table api, it's not compatible.
> >>>>
> >>>> Hao
> >>>>
> >>>> On Tue, Apr 22, 2025 at 9:39 AM Timo Walther <twal...@apache.org>
> >> wrote:
> >>>>
> >>>>> Hi Arvid, Hi Sergey,
> >>>>>
> >>>>> thanks for your feedback. I updated the FLIP accordingly but let me
> >>>>> answer your questions
> >>>>> here as well:
> >>>>>
> >>>>>    > Are we going to enforce that the name is a valid class name?
> What
> >> is
> >>>>>    > happening if it's not a correct name?
> >>>>>    > What are the implications of using a class that is not in the
> >>>>>    > classpath in Table API? It looks to me that the name is
> >>> metadata-only
> >>>>>    > until we try to access the objects directly in Table/DataStream
> >> API.
> >>>>>
> >>>>> Names are not enforced or validated. They are pure metadata as
> >> mentioned
> >>>>> in Section 2.1. We fallback to Row as the conversion class if the
> name
> >>>>> cannot be resolved in the current classpath. So when staying in the
> >> SQL
> >>>>> ecosystem (i.e. not switching to Table API, DataStream API, or UDFs),
> >>>>> the class must not be present.
> >>>>>
> >>>>>    > Should Expressions.objectOf(String, Object... kv); also have an
> >>>>>    > overload where you can put in the StructuredType in case where
> >>>>>    > the class is not in the CP?
> >>>>>
> >>>>> That makes a lot of sense. I added a DataTypes.STRUCTURED(String,
> >>>>> Field...) method and a Expressions.objectOf(String, Object...).
> >>>>>
> >>>>>    > What is the expected outcome of supplying fewer keys than
> defined
> >>>>>    > in the structured type? Are we going to make use of nullability
> >>> here?
> >>>>>    > If so, *_INSERT and *_REMOVE may have some use.
> >>>>>
> >>>>> Currently, we go with the most conservative approach, which means
> that
> >>>>> all keys need to be present. Maybe we can reserve this feature to
> >> future
> >>>>> work and make the logic more lenient.
> >>>>>
> >>>>>    > Talking about nullability: Is there some option to make the
> >> declared
> >>>>>    > fields NOT NULL? If so, could you amend one example to show
> that?
> >>>>>    > (Grammar? implies that it's not possible)
> >>>>>
> >>>>> NOT NULL is supported similar to ROW<i INT NOT NULL>. I adjusted one
> >> of
> >>>>> the examples.
> >>>>>
> >>>>>    > One bigger concern is around the naming. For me, OBJECT is used
> >> for
> >>>>>    > semi-structured types that are open. Your FLIP implies a closed
> >>> design
> >>>>>    > and that you want to add an open OBJECT later. I asked ChatGPT
> >> about
> >>>>>    > other DB implementations and it seems like STRUCT is used more
> >> often
> >>>>>    > (see below). So, I'd propose to call it STRUCT<...>, STRUCT_OF,
> >
> >>>>>    > structOf, UPDATE_STRUCT, and updateStruct respectively.
> >>>>>
> >>>>> Naming is hard. I was also torn between STRUCT, STRUCTURED, or
> OBJECT.
> >>>>> In Flink, the ROW type is rather our STRUCT type, because it works
> >> fully
> >>>>> position based. Structured types might be name-based in the future
> for
> >>>>> better schema evolution, so they rather model an OBJECT type. This
> was
> >>>>> my reason for choosing OBJECT_OF (typed to class name and fixed
> >> fields)
> >>>>> vs. OBJECT (semi-structured without fixed fields). Snowflake also
> uses
> >>>>> OBJECT(i INT) (for structured types) and OBJECT (for semi structured
> >>>>> types).
> >>>>>
> >>>>> Also, both structured and semi-structured types can then share
> >> functions
> >>>>> such as UPDATE_OBJECT().
> >>>>>
> >>>>> What do others think?
> >>>>>
> >>>>> Thanks,
> >>>>> Timo
> >>>>>
> >>>>> On 22.04.25 12:08, Sergey Nuyanzin wrote:
> >>>>>> Thanks for driving this Timo
> >>>>>>
> >>>>>> The FLIP seems reasonable to me
> >>>>>>
> >>>>>> I have one minor question/clarification
> >>>>>> do I understand it correct that after this FLIP we can execute of
> >>>>>> `typeof` against  result of `OBJECT_OF`
> >>>>>> for instance
> >>>>>> SELECT typeof(OBJECT_OF(
> >>>>>>      'com.example.User',
> >>>>>>      'name', 'Bob',
> >>>>>>      'age', 42
> >>>>>> ));
> >>>>>>
> >>>>>> should return `STRUCTURED<'com.example.User', name STRING, age INT>`
> >>>>>> ?
> >>>>>>
> >>>>>> On Tue, Apr 22, 2025 at 10:57 AM Timo Walther <twal...@apache.org>
> >>>>> wrote:
> >>>>>>>
> >>>>>>> Hi everyone,
> >>>>>>>
> >>>>>>> I would like to ask again for feedback on this FLIP. It is a rather
> >>>>>>> small change but with big impact on usability for structured data.
> >>>>>>>
> >>>>>>> Are there any objections? Otherwise I would like to continue with
> >>> voting
> >>>>>>> soon.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Timo
> >>>>>>>
> >>>>>>> On 10.04.25 07:54, Timo Walther wrote:
> >>>>>>>> Hi everyone,
> >>>>>>>>
> >>>>>>>> I would like to start a discussion about FLIP-520: Simplify
> >>>>>>>> StructuredType handling [1].
> >>>>>>>>
> >>>>>>>> Flink SQL already supports structured types in the engine,
> >>> serializers,
> >>>>>>>> UDFs, and connector interfaces. However, currently only Table API
> >> was
> >>>>>>>> able to make use of them. While UDFs can take objects as input and
> >>>>>>>> return types, it is actually quite inconvenient to use them in
> >>>>>>>> transformations.
> >>>>>>>>
> >>>>>>>> This FLIP fixes some immediate blockers in the use of structured
> >>> types.
> >>>>>>>>
> >>>>>>>> Looking forward to feedback.
> >>>>>>>>
> >>>>>>>> Cheers,
> >>>>>>>> Timo
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> [1] https://cwiki.apache.org/confluence/display/FLINK/
> >>>>>>>> FLIP-520%3A+Simplify+StructuredType+handling
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>
> >
>
>

Reply via email to