Thanks Yisha for bringing up this discussion. Schema inferring is a very interesting and useful feature, especially when it comes to formats with well defined schemas such as Protobuf/Parquet. I'm looking forward to the FLIP.
Yisha Zhou <zhouyi...@bytedance.com.invalid> 于2024年1月15日周一 16:29写道: > > Hi dev, > > Currently, we are used to creating a table by listing all physical columns > or using like syntax to reuse the table schema in Catalogs. > However, in our company there are many cases that the messages in the > external systems are with very complex schema. The worst > case is that some protobuf data has even thousands of fields in it. > > In these cases, listing fields in the DDL will be a very hard work. Creating > and updating such complex schema in Catalogs will also cost a lot. > Therefore, I’d like to introduce an ability for detecting table schema from > external files in DDL. > > A good precedent from SnowFlake[1] works like below: > > CREATE TABLE mytable > USING TEMPLATE ( > SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*)) > FROM TABLE( > INFER_SCHEMA( > LOCATION=>'@mystage/json/', > FILE_FORMAT=>'my_json_format' > ) > )); > > The INFER_SCHEMA is a table function to 'automatically detects the file > metadata schema in a set of staged data files that contain > semi-structured data and retrieves the column definitions.’ The files can be > in Parquet, Avro, ORC, JSON, and CSV. > > We don’t need to follow the syntax, but the functionality is exactly what I > want. In addition, the file can be more than just semi-structured data > file. It can be metadata file. For example, a .proto file, a .thrift file. > > As it will be a big feature which deserves a FLIP to describe it in detail. > I'm forward to your feedback and suggestions before I start to do it. > > Best, > Yisha > > [1]https://docs.snowflake.com/en/sql-reference/functions/infer_schema > <https://docs.snowflake.com/en/sql-reference/functions/infer_schema> -- Best, Benchao Li