+1 for this proposal. Storing all structured, semi-structured and unstructured data at one place has been one of the main design goals since day 1 of data lake. Semi-structured / unstructured data processing is becoming even more crucial in the AI era. As already supported by lake formats like Iceberg and Paimon and processing engines like Spark and StarRocks, I think Flink should also have that support, in order to complete the semi-structured data processing picture in the lakehouse architecture.
The FLIP looks good to me in general. Best, Xintong On Mon, Apr 14, 2025 at 6:13 PM Xuannan Su <suxuanna...@gmail.com> wrote: > Hi devs, > > I’d like to start a discussion around FLIP-521: Integrating Variant > Type into Flink: Enabling Efficient Semi-Structured Data > Processing[1]. Working with semi-structured data has long been a > foundational scenario of the Lakehouse. While JSON has traditionally > served as the primary storage format for such data, its implementation > as serialized strings introduces significant inefficiencies. > > In this FLIP, we integrate the Variant encoding, which is a compact > binary representation of semi-structured data[2], to improve the > performance of processing semi-structured data. As Paimon has > supported the Variant type recently[3], this FLIP would allow Flink to > further leverage Paimon's storage-layer optimizations, improving > performance and resource utilization for semi-structured data > pipelines. > > Best, > Xuannan > > [1] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-521%3A+Integrating+Variant+Type+into+Flink%3A+Enabling+Efficient+Semi-Structured+Data+Processing > [2] > https://github.com/apache/parquet-format/blob/master/VariantEncoding.md > [3] https://github.com/apache/paimon/issues/4471 >