Hi, I see more and more questions around Iceberg Parquet reader. I think it would be useful to have a thread that clarifies all open questions and explains the long-term plan.
1. Am I correct that performance is the main reason to have a custom reader in Iceberg? Are there any other purposes? A common question I get is why not improve parquet-mr instead of writing a new reader? I know that almost every system that cares about performance has its own reader, but why so? 2. Iceberg filters out row groups based on stats and dictionary pages on its own whereas the Spark reader simply sets filters and relies on parquet-mr to do the filtering. My assumption there is a problem in parquet-mr. Is it correct? Is it somehow related to record materialization? 3. At some point, Julien Le Dem gave a talk about supporting page skipping in Parquet. His primary example was SELECT a, b FROM t WHERE c = 'smth'. Basically, filtering data in columns based on predicates on other columns. It is a highly anticipated feature on our end. Can somebody clarify if it will be part of parquet-mr or we will have to implement this in Iceberg? 4. What is the long-term vision for the Parquet reader in Iceberg? Are there any plans to submit parts of it to parquet-mr? Will Iceberg reader be mostly independent of parquet-mr? 5. We are considering reading Parquet data into Arrow. Will be it something specific to Iceberg or generally available? I believe it is a quite common use case. Thanks, Anton