> if we can produce the segment with Parquet, which is the native format in a data lake, the consumer application (e.g., Spark jobs for ingestion) can directly dump the segments as raw byte buffer into the data lake without unwrapping each record individually and then writing to the Parquet file one by one with expensive steps of encoding and compression again.
This sounds like an interesting idea. I have one concern though. Data Lake/table formats (like Delta Lake, Hudi, Iceberg) have column-level statistics, which are important for query performance. How would column stats be handled in this proposal? On Tue, Nov 21, 2023 at 9:21 AM Xinli shang <sha...@uber.com.invalid> wrote: > Hi, all > > Can I ask for a discussion on the KIP just created KIP-1008: ParKa - the > Marriage of Parquet and Kafka > < > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1008%3A+ParKa+-+the+Marriage+of+Parquet+and+Kafka > > > ? > > -- > Xinli Shang >