Hi Steven, Thank you for your question! Firstly, the statistics such as min/max, null count, exist inside the file (page and column index), or you can consider it as inside the the Parquet segment. These statistics will be generated at the Kafka producer in our proposal when the Parquet format is applied. It is part of the Parquet format. Secondly, when the table format (like Delta, Iceberg, Hudi) is applied during ingestion, those statistics inside the file will be rolled up to the metadata of the table format.
The second part is out of the scope of this KIP because that is entirely within the realm of ingestion. This KIP provides a byte buffer to the ingestion application, and the data inside the byte buffer is in Parquet format. Let me know if you have any questions. Xinli On Sun, Nov 26, 2023 at 9:42 AM Steven Wu <stevenz...@gmail.com> wrote: > > if we can produce the segment with Parquet, which is the native format > in a data lake, the consumer application (e.g., Spark jobs for ingestion) > can directly dump the segments as raw byte buffer into the data lake > without unwrapping each record individually and then writing to the Parquet > file one by one with expensive steps of encoding and compression again. > > This sounds like an interesting idea. I have one concern though. Data > Lake/table formats (like Delta Lake, Hudi, Iceberg) have column-level > statistics, which are important for query performance. How would column > stats be handled in this proposal? > > On Tue, Nov 21, 2023 at 9:21 AM Xinli shang <sha...@uber.com.invalid> > wrote: > > > Hi, all > > > > Can I ask for a discussion on the KIP just created KIP-1008: ParKa - the > > Marriage of Parquet and Kafka > > < > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1008%3A+ParKa+-+the+Marriage+of+Parquet+and+Kafka > > > > > ? > > > > -- > > Xinli Shang > > > -- Xinli Shang