Hi Xinli, Thanks for the KIP. I see that the discussion thread has died down, which is often a tricky situation with a KIP.
I’ve been thinking about this KIP for a while and it was really good to be able to attend the Kafka Summit London session to get a proper understanding of it. I think it’s a really interesting idea, but…. I’m afraid I think this is not a good approach to solving the problem. If I understand correctly, the fundamental aim is to insert Parquet data from Kafka clients efficiently into a data lake. Part of the attraction is that Parquet compression is extremely efficient for the truly massive batches (up to 100,000 records) that you described at Kafka Summit. While this is an entirely sensible aim, I think there are better ways to do the same thing with Kafka. The KIP seems asymmetrical to me. The “rows” of your Parquet data are individual Kafka records, which are then batched up and encoded into Parquet with an appropriate schema. But on the consumption side, you don’t really want to treat them as individual records at all. Instead, the entire batch is intended to be received and copied into the data lake as a unit. You also are relying on KIP-712, which is not an approved KIP. I started wondering whether simply introducing Parquet as a new compression format for Kafka would be a neat and generally useful way to proceed. Would it really be generally applicable in the way that say, zstd is? Would it work with all of the parts of Kafka such as the log cleaner? I think the answer is that it would not. It’s an effective encoding only for massive batches, and the requirement that the compressor needs to know the schema means that components in the “middle” that might need to apply compression would need to know this information. I think the best approach would be for your Kafka records to be sent to Kafka already in Parquet format. So, you would accumulate the rows of data into large batches, encode and compress into Parquet using the correct schema, and then send to Kafka in a batch containing one large record. Then the existing producer and consumer could be used without change. I know that this means you end up with very large records, but you’re actually ending up with a very similar situation by creating huge batches of 100,000 smaller records which still need to be accommodated by Kafka. Just my 2 cents. I hope I’ve managed to revitalise the discussion thread and get some more opinions too. Thanks, Andrew > On 21 Nov 2023, at 17:20, Xinli shang <sha...@uber.com.INVALID> wrote: > > Hi, all > > Can I ask for a discussion on the KIP just created KIP-1008: ParKa - the > Marriage of Parquet and Kafka > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-1008%3A+ParKa+-+the+Marriage+of+Parquet+and+Kafka> > ? > > -- > Xinli Shang