Re: [DISCUSS] KIP-1008: ParKa - the Marriage of Parquet and Kafka

Xinli shang Sat, 02 Dec 2023 08:50:17 -0800

Hi Steven,

Thank you for your question! Firstly, the statistics such as min/max, null
count, exist inside the file (page and column index), or you can consider
it as inside the the Parquet segment. These statistics will be generated at
the Kafka producer in our proposal when the Parquet format is applied. It
is part of the Parquet format. Secondly, when the table format (like Delta,
Iceberg, Hudi) is applied during ingestion, those statistics inside the
file will be rolled up to the metadata of the table format.


The second part is out of the scope of this KIP because that is entirely
within the realm of ingestion. This KIP provides a byte buffer to the
ingestion application, and the data inside the byte buffer is in Parquet
format. Let me know if you have any questions.

Xinli

On Sun, Nov 26, 2023 at 9:42 AM Steven Wu <stevenz...@gmail.com> wrote:

> >  if we can produce the segment with Parquet, which is the native format
> in a data lake, the consumer application (e.g., Spark jobs for ingestion)
> can directly dump the segments as raw byte buffer into the data lake
> without unwrapping each record individually and then writing to the Parquet
> file one by one with expensive steps of encoding and compression again.
>
> This sounds like an interesting idea. I have one concern though. Data
> Lake/table formats (like Delta Lake, Hudi, Iceberg) have column-level
> statistics, which are important for query performance. How would column
> stats be handled in this proposal?
>
> On Tue, Nov 21, 2023 at 9:21 AM Xinli shang <sha...@uber.com.invalid>
> wrote:
>
> > Hi, all
> >
> > Can I ask for a discussion on the KIP just created KIP-1008: ParKa - the
> > Marriage of Parquet and Kafka
> > <
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1008%3A+ParKa+-+the+Marriage+of+Parquet+and+Kafka
> > >
> > ?
> >
> > --
> > Xinli Shang
> >
>


-- 
Xinli Shang

Re: [DISCUSS] KIP-1008: ParKa - the Marriage of Parquet and Kafka

Reply via email to