Hm, is it fair to say that making dictionary encoding work for predicate 
columns is a way to mitigate the absence of page skipping?

> On 6 Mar 2019, at 12:19, Anton Okolnychyi <aokolnyc...@apple.com.INVALID> 
> wrote:
> 
> Hi,
> 
> I was going through the code in Iceberg ParquetReader. Could anybody confirm 
> or correct my statements below?
> 
> Right now, Iceberg can filter out row groups in Parquet. Iceberg fetches row 
> group stats from the footer and applies ParquetMetricsRowGroupFilter on that 
> information. In addition, the footer contains metadata per column chunk 
> including its offset. ParquetDictionaryRowGroupFilter uses that column chunk 
> metadata to read an optional dictionary page for each column chunk. If a 
> dictionary page is present, it will always be at the beginning of each column 
> chunk. ParquetDictionaryRowGroupFilter ensures that all pages within a column 
> chunk are dictionary encoded when Iceberg filters out row groups based on 
> dictionaries.
> 
> Also, I have a question about skipping individual pages using page stats. To 
> the best of my knowledge, this info was originally stored in page headers, 
> which made page skipping not as efficient as it could be because it required 
> reading all page headers spread out throughout the file. I remember some 
> efforts in the Parquet community to add page level statistics to the footer.
> 
> Now let's assume we have page level stats in the footer or have an efficient 
> way to collect that info. Then we have a query that covers two columns. Using 
> a predicate on the first column, we see that page 3 doesn't contain any 
> relevant values, so we can skip the entire page for that column. However, we 
> cannot just skip page 3 for the second column as the number of values within 
> a page is not fixed and might vary between column chunks. Basically, there is 
> no one-to-one mapping between pages.
> 
> My question is if we can have a relatively efficient page skipping in Parquet 
> at this point.
> 
> Thanks,
> Anton

Reply via email to