Hi Timo, thank you very much for responding. I see that this is just the first step to get consistency between SQL and Table API and more work is to come.
I still think that there is some redundancy between STRUCT and ROW but tbh I have more issues with ROW than with STRUCT. (What is even the meaning of a nested ROW?) So +1 with your proposal and maybe we can deprecate ROW at some later point in time. Best, Arvid On Thu, Apr 24, 2025 at 11:57 AM Timo Walther <twal...@apache.org> wrote: > Hi Arvid, Hi Hao, > > thanks for this valuable feedback. Let me clarify a few things before I > go into the details. > > Just to avoid any confusion: the FLIP does not propose introducing the > StructuredType. Structured types backed by classes already exist in > Flink for years and are already supported in UDFs, Table.collect(), > StreamTableEnvironment.toDataStream, and connectors. Structured types > have been introduced for a better programmatic story in Table API. They > avoid the need for manually defining the full schema at the edges. > Manual schema work is annoying and with structured types it is possible > to use classes whereever a type is expected. > > The goal of this FLIP only to bring Table API and SQL closer together. > In general, this is only the first step of my larger vision of > structured data handling. There are basically 3 kinds of structured types: > > 1) a typed, fixed field struct like STRUCTURED<'Money', i INT, s STRING> > 2) an untyped, fixed field struct like STRUCTURED<i INT, s STRING> > (similar to Snowflake OBJECT(i INT, s STRING)) > 3) an untyped struct for semi-structured data like STRUCTURED (similar > to Snowflake OBJECT) > > RowType represents 2), StructuredType represents 1) and a future > semi-structured type can represent 3) (but out of scope for this FLIP). > > If we don't support a typed struct, Money(i INT) and User(i INT) are not > distinct in SQL. For table.collect() or eval(Row row) in UDFs, it would > mean that those need the full schema declaration in order to map to a > target type. Structured types avoid all of that and make Table API very > powerful. > > Usually both the UDF and the collect()/toDataStream() are defined in the > same Table API program. Thus, the class is usually present in the same > classpath and this becomes less of an issue in production. Casting > structured types to ROW is also supported. > > The implementation effort of this FLIP is very low. It's mostly intended > to fill missing gaps, no major overhaul of the type system. Also to > avoid any backwards compatibility issues. > > Let me know what you think. > > Cheers, > Timo > > On 23.04.25 21:27, Hao Li wrote: > > I think Arvid has a good point. Why not define Object type without class > > and when you get it in table api, try to cast it to some class? I found > > > https://docs.oracle.com/javase/1.5.0/docs/guide/jdbc/getstart/mapping.html > . > > Under `JAVA_OBJECT` type section. They have: > > > > ``` > > > > ResultSet rs = stmt.executeQuery("SELECT ENGINEERS FROM PERSONNEL"); > > while (rs.next()) { > > Engineer eng = (Engineer)rs.getObject("ENGINEERS"); > > System.out.println(eng.lastName + ", " + eng.firstName); > > } > > > > ``` > > > > For us, how about add `getFieldAs(int post, Class class)` method in Row > > type? Your example: > > > > ``` > > > > TableEnvironment env = ... > > > > Table t = env.sqlQuery("SELECT OBJECT_OF('com.example.User', 'name', > 'Bob', > > 'age', 42)"); > > > > // Tries to resolve `com.example.User` in the classpath, if not present > > returns `Row` > > t.execute().collect(); > > ``` > > > > Will be > > ``` > > TableEnvironment env = ... > > > > Table t = env.sqlQuery("SELECT OBJECT_OF('name', 'Bob', 'age', 42)"); > > > > // Tries to resolve `com.example.User` in the classpath, if not present > > returns `Row` > > For (Row row : t.execute().collect()) { > > User user = row.getFieldAs(0, User.class); > > } > > ``` > > > > For Arvid's question: "However, at that point, why do we actually need > > anything beyond ROW?" > > > > Maybe the difference is Row type shouldn't support to be casted as user > > defined class but `StructuredType` can be. > > > > Thanks, > > Hao > > > > On Wed, Apr 23, 2025 at 2:04 AM Arvid Heise <ahe...@confluent.io.invalid > > > > wrote: > > > >> Hi Timo, > >> > >> thanks for addressing my points. I'm not set on using STRUCT et al. but > >> wanted to point out the alternatives. > >> > >> Regarding the attached class name, I have similar confusion to Hao. I > >> wonder if Structures types shouldn't be anonymous by default in the > sense > >> that initially we don't attach a class name to it. As you pointed out, > it > >> has no real semantics in SQL and we can't validate it. > >> Another thing to consider is that if one user creates a table through > some > >> means and another user wants to consume it, the second user may not have > >> access to the class as is. But the user could easily create a compatible > >> class on its own. > >> > >> Consequently, I'm thinking about getting rid of the type at all. Only on > >> the edges, we can use conversion to the user types when users actually > >> access the ROW: > >> * Any table API access that wants to collect results (in your last > example > >> what is t.execute().collect(); returning? How does that work in the > >> multi-user setup sketched above? Wouldn't it be easier that the consumer > >> explicitly gives us the POJO type that it expects?) > >> * Any DataStream conversion > >> * Any UDF > >> > >> However, at that point, why do we actually need anything beyond ROW? > >> > >> Best, > >> > >> Arvid > >> > >> On Wed, Apr 23, 2025 at 8:52 AM Timo Walther <twal...@apache.org> > wrote: > >> > >>> Hi Hao, > >>> > >>> 1. Can `StructuredType` be nested? > >>> > >>> Yes this is supported. > >>> > >>> 2. What's the main reason the class won't be enforced in SQL? > >>> > >>> SQL should not care about classes. Within the SQL ecosystem, the SQL > >>> engine controls the data serialization and protocols. The SQL engine > >>> will not load the class. Classes are a concept of a JVM or Python API > >>> endpoint. This also the reason why a SQL ARRAY<BIGINT> can be > >>> represented as List<Long>, long[], Long[]. The latter are only concepts > >>> in the target programming language and might look different in Python. > >>> > >>> Regard, > >>> Timo > >>> > >>> > >>> On 22.04.25 23:54, Hao Li wrote: > >>>> Hi Timo, > >>>> > >>>> Thanks for the FLIP. +1 with a few questions: > >>>> > >>>> 1. Can `StructuredType` be nested? e.g. > `STRUCTURED<'com.example.User', > >>>> name STRING, age INT NOT NULL, address > >> STRUCTURED<'com.example.address', > >>>> street STRING, zip STRING>>` > >>>> > >>>> 2. What's the main reason the class won't be enforced in SQL? Since > >>> tables > >>>> created in SQL can also be used in Table API, will it come as a > >> surprise > >>> if > >>>> it's working in SQL and then failing in Table API? What if > >>>> `com.example.User` was not validated in SQL when creating table, then > >> the > >>>> class was created for something else with different fields and then in > >>>> Table api, it's not compatible. > >>>> > >>>> Hao > >>>> > >>>> On Tue, Apr 22, 2025 at 9:39 AM Timo Walther <twal...@apache.org> > >> wrote: > >>>> > >>>>> Hi Arvid, Hi Sergey, > >>>>> > >>>>> thanks for your feedback. I updated the FLIP accordingly but let me > >>>>> answer your questions > >>>>> here as well: > >>>>> > >>>>> > Are we going to enforce that the name is a valid class name? > What > >> is > >>>>> > happening if it's not a correct name? > >>>>> > What are the implications of using a class that is not in the > >>>>> > classpath in Table API? It looks to me that the name is > >>> metadata-only > >>>>> > until we try to access the objects directly in Table/DataStream > >> API. > >>>>> > >>>>> Names are not enforced or validated. They are pure metadata as > >> mentioned > >>>>> in Section 2.1. We fallback to Row as the conversion class if the > name > >>>>> cannot be resolved in the current classpath. So when staying in the > >> SQL > >>>>> ecosystem (i.e. not switching to Table API, DataStream API, or UDFs), > >>>>> the class must not be present. > >>>>> > >>>>> > Should Expressions.objectOf(String, Object... kv); also have an > >>>>> > overload where you can put in the StructuredType in case where > >>>>> > the class is not in the CP? > >>>>> > >>>>> That makes a lot of sense. I added a DataTypes.STRUCTURED(String, > >>>>> Field...) method and a Expressions.objectOf(String, Object...). > >>>>> > >>>>> > What is the expected outcome of supplying fewer keys than > defined > >>>>> > in the structured type? Are we going to make use of nullability > >>> here? > >>>>> > If so, *_INSERT and *_REMOVE may have some use. > >>>>> > >>>>> Currently, we go with the most conservative approach, which means > that > >>>>> all keys need to be present. Maybe we can reserve this feature to > >> future > >>>>> work and make the logic more lenient. > >>>>> > >>>>> > Talking about nullability: Is there some option to make the > >> declared > >>>>> > fields NOT NULL? If so, could you amend one example to show > that? > >>>>> > (Grammar? implies that it's not possible) > >>>>> > >>>>> NOT NULL is supported similar to ROW<i INT NOT NULL>. I adjusted one > >> of > >>>>> the examples. > >>>>> > >>>>> > One bigger concern is around the naming. For me, OBJECT is used > >> for > >>>>> > semi-structured types that are open. Your FLIP implies a closed > >>> design > >>>>> > and that you want to add an open OBJECT later. I asked ChatGPT > >> about > >>>>> > other DB implementations and it seems like STRUCT is used more > >> often > >>>>> > (see below). So, I'd propose to call it STRUCT<...>, STRUCT_OF, > > > >>>>> > structOf, UPDATE_STRUCT, and updateStruct respectively. > >>>>> > >>>>> Naming is hard. I was also torn between STRUCT, STRUCTURED, or > OBJECT. > >>>>> In Flink, the ROW type is rather our STRUCT type, because it works > >> fully > >>>>> position based. Structured types might be name-based in the future > for > >>>>> better schema evolution, so they rather model an OBJECT type. This > was > >>>>> my reason for choosing OBJECT_OF (typed to class name and fixed > >> fields) > >>>>> vs. OBJECT (semi-structured without fixed fields). Snowflake also > uses > >>>>> OBJECT(i INT) (for structured types) and OBJECT (for semi structured > >>>>> types). > >>>>> > >>>>> Also, both structured and semi-structured types can then share > >> functions > >>>>> such as UPDATE_OBJECT(). > >>>>> > >>>>> What do others think? > >>>>> > >>>>> Thanks, > >>>>> Timo > >>>>> > >>>>> On 22.04.25 12:08, Sergey Nuyanzin wrote: > >>>>>> Thanks for driving this Timo > >>>>>> > >>>>>> The FLIP seems reasonable to me > >>>>>> > >>>>>> I have one minor question/clarification > >>>>>> do I understand it correct that after this FLIP we can execute of > >>>>>> `typeof` against result of `OBJECT_OF` > >>>>>> for instance > >>>>>> SELECT typeof(OBJECT_OF( > >>>>>> 'com.example.User', > >>>>>> 'name', 'Bob', > >>>>>> 'age', 42 > >>>>>> )); > >>>>>> > >>>>>> should return `STRUCTURED<'com.example.User', name STRING, age INT>` > >>>>>> ? > >>>>>> > >>>>>> On Tue, Apr 22, 2025 at 10:57 AM Timo Walther <twal...@apache.org> > >>>>> wrote: > >>>>>>> > >>>>>>> Hi everyone, > >>>>>>> > >>>>>>> I would like to ask again for feedback on this FLIP. It is a rather > >>>>>>> small change but with big impact on usability for structured data. > >>>>>>> > >>>>>>> Are there any objections? Otherwise I would like to continue with > >>> voting > >>>>>>> soon. > >>>>>>> > >>>>>>> Thanks, > >>>>>>> Timo > >>>>>>> > >>>>>>> On 10.04.25 07:54, Timo Walther wrote: > >>>>>>>> Hi everyone, > >>>>>>>> > >>>>>>>> I would like to start a discussion about FLIP-520: Simplify > >>>>>>>> StructuredType handling [1]. > >>>>>>>> > >>>>>>>> Flink SQL already supports structured types in the engine, > >>> serializers, > >>>>>>>> UDFs, and connector interfaces. However, currently only Table API > >> was > >>>>>>>> able to make use of them. While UDFs can take objects as input and > >>>>>>>> return types, it is actually quite inconvenient to use them in > >>>>>>>> transformations. > >>>>>>>> > >>>>>>>> This FLIP fixes some immediate blockers in the use of structured > >>> types. > >>>>>>>> > >>>>>>>> Looking forward to feedback. > >>>>>>>> > >>>>>>>> Cheers, > >>>>>>>> Timo > >>>>>>>> > >>>>>>>> > >>>>>>>> [1] https://cwiki.apache.org/confluence/display/FLINK/ > >>>>>>>> FLIP-520%3A+Simplify+StructuredType+handling > >>>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>> > >>>>> > >>>> > >>> > >>> > >> > > > >