Hi,

I see more and more questions around Iceberg Parquet reader. I think it would 
be useful to have a thread that clarifies all open questions and explains the 
long-term plan.

1. Am I correct that performance is the main reason to have a custom reader in 
Iceberg? Are there any other purposes? A common question I get is why not 
improve parquet-mr instead of writing a new reader? I know that almost every 
system that cares about performance has its own reader, but why so? 

2.  Iceberg filters out row groups based on stats and dictionary pages on its 
own whereas the Spark reader simply sets filters and relies on parquet-mr to do 
the filtering. My assumption there is a problem in parquet-mr. Is it correct? 
Is it somehow related to record materialization?

3. At some point, Julien Le Dem gave a talk about supporting page skipping in 
Parquet. His primary example was SELECT a, b FROM t WHERE c = 'smth'. 
Basically, filtering data in columns based on predicates on other columns. It 
is a highly anticipated feature on our end. Can somebody clarify if it will be 
part of parquet-mr or we will have to implement this in Iceberg?

4.  What is the long-term vision for the Parquet reader in Iceberg? Are there 
any plans to submit parts of it to parquet-mr? Will Iceberg reader be mostly 
independent of parquet-mr?

5. We are considering reading Parquet data into Arrow. Will be it something 
specific to Iceberg or generally available? I believe it is a quite common use 
case.

Thanks,
Anton

Reply via email to