Re: [DISCUSS] FLIP-520: Simplify StructuredType handling

Hao Li Thu, 24 Apr 2025 10:19:42 -0700

Hi Timo,

Thanks for the clarification. It's very helpful.


For the classpath, I suppose it can also support Python later if it's
called in Python table api? Do we want to indicate if it's Java classpath
or Python class? Or we support a list of classpath which can consist both
Python, Java or other languages later?

Thanks,
Hao

On Thu, Apr 24, 2025 at 5:34 AM Arvid Heise <[email protected]> wrote:

> Hi Timo,
>
> thank you very much for responding. I see that this is just the first step
> to get consistency between SQL and Table API and more work is to come.
>
> I still think that there is some redundancy between STRUCT and ROW but tbh
> I have more issues with ROW than with STRUCT. (What is even the meaning of
> a nested ROW?)
>
> So +1 with your proposal and maybe we can deprecate ROW at some later point
> in time.
>
> Best,
>
> Arvid
>
> On Thu, Apr 24, 2025 at 11:57 AM Timo Walther <[email protected]> wrote:
>
> > Hi Arvid, Hi Hao,
> >
> > thanks for this valuable feedback. Let me clarify a few things before I
> > go into the details.
> >
> > Just to avoid any confusion: the FLIP does not propose introducing the
> > StructuredType. Structured types backed by classes already exist in
> > Flink for years and are already supported in UDFs, Table.collect(),
> > StreamTableEnvironment.toDataStream, and connectors. Structured types
> > have been introduced for a better programmatic story in Table API. They
> > avoid the need for manually defining the full schema at the edges.
> > Manual schema work is annoying and with structured types it is possible
> > to use classes whereever a type is expected.
> >
> > The goal of this FLIP only to bring Table API and SQL closer together.
> > In general, this is only the first step of my larger vision of
> > structured data handling. There are basically 3 kinds of structured
> types:
> >
> > 1) a typed, fixed field struct like STRUCTURED<'Money', i INT, s STRING>
> > 2) an untyped, fixed field struct like STRUCTURED<i INT, s STRING>
> > (similar to Snowflake OBJECT(i INT, s STRING))
> > 3) an untyped struct for semi-structured data like STRUCTURED (similar
> > to Snowflake OBJECT)
> >
> > RowType represents 2), StructuredType represents 1) and a future
> > semi-structured type can represent 3) (but out of scope for this FLIP).
> >
> > If we don't support a typed struct, Money(i INT) and User(i INT) are not
> > distinct in SQL. For table.collect() or eval(Row row) in UDFs, it would
> > mean that those need the full schema declaration in order to map to a
> > target type. Structured types avoid all of that and make Table API very
> > powerful.
> >
> > Usually both the UDF and the collect()/toDataStream() are defined in the
> > same Table API program. Thus, the class is usually present in the same
> > classpath and this becomes less of an issue in production. Casting
> > structured types to ROW is also supported.
> >
> > The implementation effort of this FLIP is very low. It's mostly intended
> > to fill missing gaps, no major overhaul of the type system. Also to
> > avoid any backwards compatibility issues.
> >
> > Let me know what you think.
> >
> > Cheers,
> > Timo
> >
> > On 23.04.25 21:27, Hao Li wrote:
> > > I think Arvid has a good point. Why not define Object type without
> class
> > > and when you get it in table api, try to cast it to some class? I found
> > >
> >
> https://docs.oracle.com/javase/1.5.0/docs/guide/jdbc/getstart/mapping.html
> > .
> > > Under `JAVA_OBJECT` type section. They have:
> > >
> > > ```
> > >
> > > ResultSet rs = stmt.executeQuery("SELECT ENGINEERS FROM PERSONNEL");
> > > while (rs.next()) {
> > > Engineer eng = (Engineer)rs.getObject("ENGINEERS");
> > > System.out.println(eng.lastName + ", " + eng.firstName);
> > > }
> > >
> > > ```
> > >
> > > For us, how about add `getFieldAs(int post, Class class)` method in Row
> > > type? Your example:
> > >
> > > ```
> > >
> > > TableEnvironment env = ...
> > >
> > > Table t = env.sqlQuery("SELECT OBJECT_OF('com.example.User', 'name',
> > 'Bob',
> > > 'age', 42)");
> > >
> > > // Tries to resolve `com.example.User` in the classpath, if not present
> > > returns `Row`
> > > t.execute().collect();
> > > ```
> > >
> > > Will be
> > > ```
> > > TableEnvironment env = ...
> > >
> > > Table t = env.sqlQuery("SELECT OBJECT_OF('name', 'Bob', 'age', 42)");
> > >
> > > // Tries to resolve `com.example.User` in the classpath, if not present
> > > returns `Row`
> > > For (Row row : t.execute().collect()) {
> > >      User user = row.getFieldAs(0, User.class);
> > > }
> > > ```
> > >
> > > For Arvid's question: "However, at that point, why do we actually need
> > > anything beyond ROW?"
> > >
> > > Maybe the difference is Row type shouldn't support to be casted as user
> > > defined class but `StructuredType` can be.
> > >
> > > Thanks,
> > > Hao
> > >
> > > On Wed, Apr 23, 2025 at 2:04 AM Arvid Heise
> <[email protected]
> > >
> > > wrote:
> > >
> > >> Hi Timo,
> > >>
> > >> thanks for addressing my points. I'm not set on using STRUCT et al.
> but
> > >> wanted to point out the alternatives.
> > >>
> > >> Regarding the attached class name, I have similar confusion to Hao. I
> > >> wonder if Structures types shouldn't be anonymous by default in the
> > sense
> > >> that initially we don't attach a class name to it. As you pointed out,
> > it
> > >> has no real semantics in SQL and we can't validate it.
> > >> Another thing to consider is that if one user creates a table through
> > some
> > >> means and another user wants to consume it, the second user may not
> have
> > >> access to the class as is. But the user could easily create a
> compatible
> > >> class on its own.
> > >>
> > >> Consequently, I'm thinking about getting rid of the type at all. Only
> on
> > >> the edges, we can use conversion to the user types when users actually
> > >> access the ROW:
> > >> * Any table API access that wants to collect results (in your last
> > example
> > >> what is t.execute().collect(); returning? How does that work in the
> > >> multi-user setup sketched above? Wouldn't it be easier that the
> consumer
> > >> explicitly gives us the POJO type that it expects?)
> > >> * Any DataStream conversion
> > >> * Any UDF
> > >>
> > >> However, at that point, why do we actually need anything beyond ROW?
> > >>
> > >> Best,
> > >>
> > >> Arvid
> > >>
> > >> On Wed, Apr 23, 2025 at 8:52 AM Timo Walther <[email protected]>
> > wrote:
> > >>
> > >>> Hi Hao,
> > >>>
> > >>> 1. Can `StructuredType` be nested?
> > >>>
> > >>> Yes this is supported.
> > >>>
> > >>> 2. What's the main reason the class won't be enforced in SQL?
> > >>>
> > >>> SQL should not care about classes. Within the SQL ecosystem, the SQL
> > >>> engine controls the data serialization and protocols. The SQL engine
> > >>> will not load the class. Classes are a concept of a JVM or Python API
> > >>> endpoint. This also the reason why a SQL ARRAY<BIGINT> can be
> > >>> represented as List<Long>, long[], Long[]. The latter are only
> concepts
> > >>> in the target programming language and might look different in
> Python.
> > >>>
> > >>> Regard,
> > >>> Timo
> > >>>
> > >>>
> > >>> On 22.04.25 23:54, Hao Li wrote:
> > >>>> Hi Timo,
> > >>>>
> > >>>> Thanks for the FLIP. +1 with a few questions:
> > >>>>
> > >>>> 1. Can `StructuredType` be nested? e.g.
> > `STRUCTURED<'com.example.User',
> > >>>> name STRING, age INT NOT NULL, address
> > >> STRUCTURED<'com.example.address',
> > >>>> street STRING, zip STRING>>`
> > >>>>
> > >>>> 2. What's the main reason the class won't be enforced in SQL? Since
> > >>> tables
> > >>>> created in SQL can also be used in Table API, will it come as a
> > >> surprise
> > >>> if
> > >>>> it's working in SQL and then failing in Table API? What if
> > >>>> `com.example.User` was not validated in SQL when creating table,
> then
> > >> the
> > >>>> class was created for something else with different fields and then
> in
> > >>>> Table api, it's not compatible.
> > >>>>
> > >>>> Hao
> > >>>>
> > >>>> On Tue, Apr 22, 2025 at 9:39 AM Timo Walther <[email protected]>
> > >> wrote:
> > >>>>
> > >>>>> Hi Arvid, Hi Sergey,
> > >>>>>
> > >>>>> thanks for your feedback. I updated the FLIP accordingly but let me
> > >>>>> answer your questions
> > >>>>> here as well:
> > >>>>>
> > >>>>>    > Are we going to enforce that the name is a valid class name?
> > What
> > >> is
> > >>>>>    > happening if it's not a correct name?
> > >>>>>    > What are the implications of using a class that is not in the
> > >>>>>    > classpath in Table API? It looks to me that the name is
> > >>> metadata-only
> > >>>>>    > until we try to access the objects directly in
> Table/DataStream
> > >> API.
> > >>>>>
> > >>>>> Names are not enforced or validated. They are pure metadata as
> > >> mentioned
> > >>>>> in Section 2.1. We fallback to Row as the conversion class if the
> > name
> > >>>>> cannot be resolved in the current classpath. So when staying in the
> > >> SQL
> > >>>>> ecosystem (i.e. not switching to Table API, DataStream API, or
> UDFs),
> > >>>>> the class must not be present.
> > >>>>>
> > >>>>>    > Should Expressions.objectOf(String, Object... kv); also have
> an
> > >>>>>    > overload where you can put in the StructuredType in case where
> > >>>>>    > the class is not in the CP?
> > >>>>>
> > >>>>> That makes a lot of sense. I added a DataTypes.STRUCTURED(String,
> > >>>>> Field...) method and a Expressions.objectOf(String, Object...).
> > >>>>>
> > >>>>>    > What is the expected outcome of supplying fewer keys than
> > defined
> > >>>>>    > in the structured type? Are we going to make use of
> nullability
> > >>> here?
> > >>>>>    > If so, *_INSERT and *_REMOVE may have some use.
> > >>>>>
> > >>>>> Currently, we go with the most conservative approach, which means
> > that
> > >>>>> all keys need to be present. Maybe we can reserve this feature to
> > >> future
> > >>>>> work and make the logic more lenient.
> > >>>>>
> > >>>>>    > Talking about nullability: Is there some option to make the
> > >> declared
> > >>>>>    > fields NOT NULL? If so, could you amend one example to show
> > that?
> > >>>>>    > (Grammar? implies that it's not possible)
> > >>>>>
> > >>>>> NOT NULL is supported similar to ROW<i INT NOT NULL>. I adjusted
> one
> > >> of
> > >>>>> the examples.
> > >>>>>
> > >>>>>    > One bigger concern is around the naming. For me, OBJECT is
> used
> > >> for
> > >>>>>    > semi-structured types that are open. Your FLIP implies a
> closed
> > >>> design
> > >>>>>    > and that you want to add an open OBJECT later. I asked ChatGPT
> > >> about
> > >>>>>    > other DB implementations and it seems like STRUCT is used more
> > >> often
> > >>>>>    > (see below). So, I'd propose to call it STRUCT<...>,
> STRUCT_OF,
> > >
> > >>>>>    > structOf, UPDATE_STRUCT, and updateStruct respectively.
> > >>>>>
> > >>>>> Naming is hard. I was also torn between STRUCT, STRUCTURED, or
> > OBJECT.
> > >>>>> In Flink, the ROW type is rather our STRUCT type, because it works
> > >> fully
> > >>>>> position based. Structured types might be name-based in the future
> > for
> > >>>>> better schema evolution, so they rather model an OBJECT type. This
> > was
> > >>>>> my reason for choosing OBJECT_OF (typed to class name and fixed
> > >> fields)
> > >>>>> vs. OBJECT (semi-structured without fixed fields). Snowflake also
> > uses
> > >>>>> OBJECT(i INT) (for structured types) and OBJECT (for semi
> structured
> > >>>>> types).
> > >>>>>
> > >>>>> Also, both structured and semi-structured types can then share
> > >> functions
> > >>>>> such as UPDATE_OBJECT().
> > >>>>>
> > >>>>> What do others think?
> > >>>>>
> > >>>>> Thanks,
> > >>>>> Timo
> > >>>>>
> > >>>>> On 22.04.25 12:08, Sergey Nuyanzin wrote:
> > >>>>>> Thanks for driving this Timo
> > >>>>>>
> > >>>>>> The FLIP seems reasonable to me
> > >>>>>>
> > >>>>>> I have one minor question/clarification
> > >>>>>> do I understand it correct that after this FLIP we can execute of
> > >>>>>> `typeof` against  result of `OBJECT_OF`
> > >>>>>> for instance
> > >>>>>> SELECT typeof(OBJECT_OF(
> > >>>>>>      'com.example.User',
> > >>>>>>      'name', 'Bob',
> > >>>>>>      'age', 42
> > >>>>>> ));
> > >>>>>>
> > >>>>>> should return `STRUCTURED<'com.example.User', name STRING, age
> INT>`
> > >>>>>> ?
> > >>>>>>
> > >>>>>> On Tue, Apr 22, 2025 at 10:57 AM Timo Walther <[email protected]
> >
> > >>>>> wrote:
> > >>>>>>>
> > >>>>>>> Hi everyone,
> > >>>>>>>
> > >>>>>>> I would like to ask again for feedback on this FLIP. It is a
> rather
> > >>>>>>> small change but with big impact on usability for structured
> data.
> > >>>>>>>
> > >>>>>>> Are there any objections? Otherwise I would like to continue with
> > >>> voting
> > >>>>>>> soon.
> > >>>>>>>
> > >>>>>>> Thanks,
> > >>>>>>> Timo
> > >>>>>>>
> > >>>>>>> On 10.04.25 07:54, Timo Walther wrote:
> > >>>>>>>> Hi everyone,
> > >>>>>>>>
> > >>>>>>>> I would like to start a discussion about FLIP-520: Simplify
> > >>>>>>>> StructuredType handling [1].
> > >>>>>>>>
> > >>>>>>>> Flink SQL already supports structured types in the engine,
> > >>> serializers,
> > >>>>>>>> UDFs, and connector interfaces. However, currently only Table
> API
> > >> was
> > >>>>>>>> able to make use of them. While UDFs can take objects as input
> and
> > >>>>>>>> return types, it is actually quite inconvenient to use them in
> > >>>>>>>> transformations.
> > >>>>>>>>
> > >>>>>>>> This FLIP fixes some immediate blockers in the use of structured
> > >>> types.
> > >>>>>>>>
> > >>>>>>>> Looking forward to feedback.
> > >>>>>>>>
> > >>>>>>>> Cheers,
> > >>>>>>>> Timo
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> [1] https://cwiki.apache.org/confluence/display/FLINK/
> > >>>>>>>> FLIP-520%3A+Simplify+StructuredType+handling
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>>
> > >>
> > >
> >
> >
>

Re: [DISCUSS] FLIP-520: Simplify StructuredType handling

Reply via email to