>  if we can produce the segment with Parquet, which is the native format
in a data lake, the consumer application (e.g., Spark jobs for ingestion)
can directly dump the segments as raw byte buffer into the data lake
without unwrapping each record individually and then writing to the Parquet
file one by one with expensive steps of encoding and compression again.

This sounds like an interesting idea. I have one concern though. Data
Lake/table formats (like Delta Lake, Hudi, Iceberg) have column-level
statistics, which are important for query performance. How would column
stats be handled in this proposal?

On Tue, Nov 21, 2023 at 9:21 AM Xinli shang <sha...@uber.com.invalid> wrote:

> Hi, all
>
> Can I ask for a discussion on the KIP just created KIP-1008: ParKa - the
> Marriage of Parquet and Kafka
> <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1008%3A+ParKa+-+the+Marriage+of+Parquet+and+Kafka
> >
> ?
>
> --
> Xinli Shang
>

Reply via email to