+1 for this proposal.

Storing all structured, semi-structured and unstructured data at one place
has been one of the main design goals since day 1 of data lake.
Semi-structured / unstructured data processing is becoming even more
crucial in the AI era. As already supported by lake formats like Iceberg
and Paimon and processing engines like Spark and StarRocks, I think Flink
should also have that support, in order to complete the semi-structured
data processing picture in the lakehouse architecture.

The FLIP looks good to me in general.

Best,

Xintong



On Mon, Apr 14, 2025 at 6:13 PM Xuannan Su <suxuanna...@gmail.com> wrote:

> Hi devs,
>
> I’d like to start a discussion around FLIP-521: Integrating Variant
> Type into Flink: Enabling Efficient Semi-Structured Data
> Processing[1]. Working with semi-structured data has long been a
> foundational scenario of the Lakehouse. While JSON has traditionally
> served as the primary storage format for such data, its implementation
> as serialized strings introduces significant inefficiencies.
>
> In this FLIP, we integrate the Variant encoding, which is a compact
> binary representation of semi-structured data[2], to improve the
> performance of processing semi-structured data. As Paimon has
> supported the Variant type recently[3], this FLIP would allow Flink to
> further leverage Paimon's storage-layer optimizations, improving
> performance and resource utilization for semi-structured data
> pipelines.
>
> Best,
> Xuannan
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-521%3A+Integrating+Variant+Type+into+Flink%3A+Enabling+Efficient+Semi-Structured+Data+Processing
> [2]
> https://github.com/apache/parquet-format/blob/master/VariantEncoding.md
> [3] https://github.com/apache/paimon/issues/4471
>

Reply via email to