Hi Xuannan,

sorry for the delay. This is a great addition but needs careful design. Here is some feedback that I found while reading the design doc:

1. Exception design

The FLIP mentions IllegalStateException to be thrown. However, this is a very generic exception. We should rather use TableRuntimeException to indicate that this is a user's mistake. Or maybe even a dedicated new runtime exception type for variants.

2. Variant.isValue()

The JavaDoc already indicates that a `isScalar()` might be more appropriate naming.

2. Variant.getBinary()

Shall we call this getBytes()? This rather fits to the data type BYTES. We are also not calling `getString()` as `getVarchar()`.

3. VariantBuilder<T>

Could you give some explanation why the VariantBuilder uses a generic? Maybe give an example usage in a UDF?

4. Variant.get() vs. Variant.getField

We should be more explicit in the naming. How about we use `Variant.getField(String)` and `Variant.getElement(int)` instead? This would also allow us to introduce a `Object Variant.get()` that returns a value purely on the `getType` and a `<T> T getAs()` consistent with the method `getFieldAs` in the Row class.

5. Type.TIMESTAMP_NTZ

Do you mean TIMESTAMP_LTZ here?

6. Implementation of Variant that follow the same layout as the Parquet Variant Encoding

Parquet Variant Encoding supports int8, int16, int32. Shouldn't we also then support those types in `Variant` class like Short, Byte etc.? And especially I noticed that Integer is not supported? We should aim to be close to the existing type system.

Also, will we add Parque dependencies or implement this within the Flink code base for modelling the same binary layout?

7. Packages / modules

Please specify where you want to add the classes. Which package/module?

8. Built-in functions

We recently introduced a `JSON()` function but reserved the use to `JSON_OBJECT` and `JSON_ARRAY` only. With variant type support, we can allow using this function at any location and return a VARIANT type. I guess it maps to `TRY_PARSE_JSON`?

A TO_JSON function is not required as `JSON_STRING` already supports arbitrary types and just needs to be extended.

9. Extract values from a variant

The SQL standard defined SQL/JSON item methods (T865–T878) [1]. It would make SQL more readable to support a method per data type.
E.g. `SELECT integer(variantCol), timestamp(variantCol)`

10. Bumping Calcite

The FLIP states that you want to bump the version of Apache Calcite to 1.39.0. Are you planning to just backport required classes or really upgrade to this version? If this is a full upgrade, we should do this in stages as history has shown that bumping Calcite is not a straight-forward task. But in any case, this would be great to do as we could benefit from other features such as lambda support.

11. String representation and casting rules

Could you elaborate in the FLIP how CAST(variant AS ROW<i INT>) or CAST(variant AS INT) behave? Also could you elaborate how the `env.executeSql().print()` representation looks like? Will it be JSON? Also, Row.toString should support Variant type.

Cheers,
Timo


[1] https://peter.eisentraut.org/blog/2023/04/04/sql-2023-is-finished-here-is-whats-new


On 25.04.25 13:46, Xuannan Su wrote:
Hi everyone,

Thank you for all the comments! If there are no further comments, I'd
like to close the discussion and start the voting next Monday.

Best,
Xuannan

On Fri, Apr 25, 2025 at 7:41 PM Lincoln Lee <lincoln.8...@gmail.com> wrote:

+1 for this FLIP. VARIANT type support will be a great addition to sql.
Look forward to the detailed design of the subsequent shredding
optimizations.


Best,
Lincoln Lee


Timo Walther <twal...@apache.org> 于2025年4月22日周二 16:51写道:

+1 for this feature. Having a VARIANT type makes a lot of sense and
together with an OBJECT type will make semi-structured data processing
in Flink easier.

Currently, I'm catching up with notifications after the easter holidays,
but happy to give some feedback by tomorrow or Thursday as well.

Thanks,
Timo

On 22.04.25 10:40, Jingsong Li wrote:
Thanks Xuannan for driving this discussion.

At present, communities such as Apache Iceberg, Delta, Spark, Parquet,
etc. are all designing and developing around Variant, and our Flink
support for Variant is very valuable.

After a rough look at the design, there is no overall problem. It is
designed around Parquet's Variant standard, which is similar to the
overall design of Spark SQL.

+1 for this.

Best,
Jingsong

On Mon, Apr 14, 2025 at 6:12 PM Xuannan Su <suxuanna...@gmail.com>
wrote:

Hi devs,

I’d like to start a discussion around FLIP-521: Integrating Variant
Type into Flink: Enabling Efficient Semi-Structured Data
Processing[1]. Working with semi-structured data has long been a
foundational scenario of the Lakehouse. While JSON has traditionally
served as the primary storage format for such data, its implementation
as serialized strings introduces significant inefficiencies.

In this FLIP, we integrate the Variant encoding, which is a compact
binary representation of semi-structured data[2], to improve the
performance of processing semi-structured data. As Paimon has
supported the Variant type recently[3], this FLIP would allow Flink to
further leverage Paimon's storage-layer optimizations, improving
performance and resource utilization for semi-structured data
pipelines.

Best,
Xuannan

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-521%3A+Integrating+Variant+Type+into+Flink%3A+Enabling+Efficient+Semi-Structured+Data+Processing
[2]
https://github.com/apache/parquet-format/blob/master/VariantEncoding.md
[3] https://github.com/apache/paimon/issues/4471





Reply via email to