Re: [DISCUSS] FLIP-521: Integrating Variant Type into Flink: Enabling Efficient Semi-Structured Data Processing

Timo Walther Fri, 25 Apr 2025 07:10:53 -0700

Hi Xuannan,

sorry for the delay. This is a great addition but needs careful design.Here is some feedback that I found while reading the design doc:


1. Exception design

The FLIP mentions IllegalStateException to be thrown. However, this is avery generic exception. We should rather use TableRuntimeException toindicate that this is a user's mistake. Or maybe even a dedicated newruntime exception type for variants.


2. Variant.isValue()

The JavaDoc already indicates that a `isScalar()` might be moreappropriate naming.


2. Variant.getBinary()

Shall we call this getBytes()? This rather fits to the data type BYTES.We are also not calling `getString()` as `getVarchar()`.


3. VariantBuilder<T>

Could you give some explanation why the VariantBuilder uses a generic?Maybe give an example usage in a UDF?


4. Variant.get() vs. Variant.getField

We should be more explicit in the naming. How about we use`Variant.getField(String)` and `Variant.getElement(int)` instead? Thiswould also allow us to introduce a `Object Variant.get()` that returns avalue purely on the `getType` and a `<T> T getAs()` consistent with themethod `getFieldAs` in the Row class.


5. Type.TIMESTAMP_NTZ

Do you mean TIMESTAMP_LTZ here?

6. Implementation of Variant that follow the same layout as the ParquetVariant Encoding

Parquet Variant Encoding supports int8, int16, int32. Shouldn't we alsothen support those types in `Variant` class like Short, Byte etc.? Andespecially I noticed that Integer is not supported? We should aim to beclose to the existing type system.

Also, will we add Parque dependencies or implement this within the Flinkcode base for modelling the same binary layout?


7. Packages / modules

Please specify where you want to add the classes. Which package/module?

8. Built-in functions

We recently introduced a `JSON()` function but reserved the use to`JSON_OBJECT` and `JSON_ARRAY` only. With variant type support, we canallow using this function at any location and return a VARIANT type. Iguess it maps to `TRY_PARSE_JSON`?

A TO_JSON function is not required as `JSON_STRING` already supportsarbitrary types and just needs to be extended.


9. Extract values from a variant

The SQL standard defined SQL/JSON item methods (T865–T878) [1]. It wouldmake SQL more readable to support a method per data type.

E.g. `SELECT integer(variantCol), timestamp(variantCol)`

10. Bumping Calcite

The FLIP states that you want to bump the version of Apache Calcite to1.39.0. Are you planning to just backport required classes or reallyupgrade to this version? If this is a full upgrade, we should do this instages as history has shown that bumping Calcite is not astraight-forward task. But in any case, this would be great to do as wecould benefit from other features such as lambda support.


11. String representation and casting rules

Could you elaborate in the FLIP how CAST(variant AS ROW<i INT>) orCAST(variant AS INT) behave? Also could you elaborate how the`env.executeSql().print()` representation looks like? Will it be JSON?Also, Row.toString should support Variant type.


Cheers,
Timo

[1]https://peter.eisentraut.org/blog/2023/04/04/sql-2023-is-finished-here-is-whats-new



On 25.04.25 13:46, Xuannan Su wrote:

Hi everyone,

Thank you for all the comments! If there are no further comments, I'd
like to close the discussion and start the voting next Monday.

Best,
Xuannan

On Fri, Apr 25, 2025 at 7:41 PM Lincoln Lee <lincoln.8...@gmail.com> wrote:


+1 for this FLIP. VARIANT type support will be a great addition to sql.
Look forward to the detailed design of the subsequent shredding
optimizations.


Best,
Lincoln Lee


Timo Walther <twal...@apache.org> 于2025年4月22日周二 16:51写道：

+1 for this feature. Having a VARIANT type makes a lot of sense and
together with an OBJECT type will make semi-structured data processing
in Flink easier.

Currently, I'm catching up with notifications after the easter holidays,
but happy to give some feedback by tomorrow or Thursday as well.

Thanks,
Timo

On 22.04.25 10:40, Jingsong Li wrote:

Thanks Xuannan for driving this discussion.

At present, communities such as Apache Iceberg, Delta, Spark, Parquet,
etc. are all designing and developing around Variant, and our Flink
support for Variant is very valuable.

After a rough look at the design, there is no overall problem. It is
designed around Parquet's Variant standard, which is similar to the
overall design of Spark SQL.

+1 for this.

Best,
Jingsong

On Mon, Apr 14, 2025 at 6:12 PM Xuannan Su <suxuanna...@gmail.com>

wrote:


Hi devs,

I’d like to start a discussion around FLIP-521: Integrating Variant
Type into Flink: Enabling Efficient Semi-Structured Data
Processing[1]. Working with semi-structured data has long been a
foundational scenario of the Lakehouse. While JSON has traditionally
served as the primary storage format for such data, its implementation
as serialized strings introduces significant inefficiencies.

In this FLIP, we integrate the Variant encoding, which is a compact
binary representation of semi-structured data[2], to improve the
performance of processing semi-structured data. As Paimon has
supported the Variant type recently[3], this FLIP would allow Flink to
further leverage Paimon's storage-layer optimizations, improving
performance and resource utilization for semi-structured data
pipelines.

Best,
Xuannan

[1]

https://cwiki.apache.org/confluence/display/FLINK/FLIP-521%3A+Integrating+Variant+Type+into+Flink%3A+Enabling+Efficient+Semi-Structured+Data+Processing

[2]

https://github.com/apache/parquet-format/blob/master/VariantEncoding.md

[3] https://github.com/apache/paimon/issues/4471

Re: [DISCUSS] FLIP-521: Integrating Variant Type into Flink: Enabling Efficient Semi-Structured Data Processing

Reply via email to