Hi Xinli,
Thanks for the KIP. I see that the discussion thread has died down, which is 
often a tricky situation with a KIP.

I’ve been thinking about this KIP for a while and it was really good to be able 
to attend the Kafka Summit London
session to get a proper understanding of it. I think it’s a really interesting 
idea, but….

I’m afraid I think this is not a good approach to solving the problem. If I 
understand correctly, the fundamental
aim is to insert Parquet data from Kafka clients efficiently into a data lake. 
Part of the attraction is that Parquet
compression is extremely efficient for the truly massive batches (up to 100,000 
records) that you described at
Kafka Summit. While this is an entirely sensible aim, I think there are better 
ways to do the same thing with
Kafka.

The KIP seems asymmetrical to me. The “rows” of your Parquet data are 
individual Kafka records, which are then
batched up and encoded into Parquet with an appropriate schema. But on the 
consumption side, you don’t
really want to treat them as individual records at all. Instead, the entire 
batch is intended to be received and copied into the
data lake as a unit. You also are relying on KIP-712, which is not an approved 
KIP.

I started wondering whether simply introducing Parquet as a new compression 
format for Kafka would be a neat
and generally useful way to proceed. Would it really be generally applicable in 
the way that say, zstd is?
Would it work with all of the parts of Kafka such as the log cleaner? I think 
the answer is that it would not. It’s an
effective encoding only for massive batches, and the requirement that the 
compressor needs to know the schema
means that components in the “middle” that might need to apply compression 
would need to know this information.

I think the best approach would be for your Kafka records to be sent to Kafka 
already in Parquet format. So, you
would accumulate the rows of data into large batches, encode and compress into 
Parquet using the correct schema,
and then send to Kafka in a batch containing one large record. Then the 
existing producer and consumer could be used
without change. I know that this means you end up with very large records, but 
you’re actually ending up with a very similar
situation by creating huge batches of 100,000 smaller records which still need 
to be accommodated by Kafka.

Just my 2 cents. I hope I’ve managed to revitalise the discussion thread and 
get some more opinions too.

Thanks,
Andrew

> On 21 Nov 2023, at 17:20, Xinli shang <sha...@uber.com.INVALID> wrote:
>
> Hi, all
>
> Can I ask for a discussion on the KIP just created KIP-1008: ParKa - the
> Marriage of Parquet and Kafka
> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-1008%3A+ParKa+-+the+Marriage+of+Parquet+and+Kafka>
> ?
>
> --
> Xinli Shang

Reply via email to