Hm, is it fair to say that making dictionary encoding work for predicate columns is a way to mitigate the absence of page skipping?
> On 6 Mar 2019, at 12:19, Anton Okolnychyi <aokolnyc...@apple.com.INVALID> > wrote: > > Hi, > > I was going through the code in Iceberg ParquetReader. Could anybody confirm > or correct my statements below? > > Right now, Iceberg can filter out row groups in Parquet. Iceberg fetches row > group stats from the footer and applies ParquetMetricsRowGroupFilter on that > information. In addition, the footer contains metadata per column chunk > including its offset. ParquetDictionaryRowGroupFilter uses that column chunk > metadata to read an optional dictionary page for each column chunk. If a > dictionary page is present, it will always be at the beginning of each column > chunk. ParquetDictionaryRowGroupFilter ensures that all pages within a column > chunk are dictionary encoded when Iceberg filters out row groups based on > dictionaries. > > Also, I have a question about skipping individual pages using page stats. To > the best of my knowledge, this info was originally stored in page headers, > which made page skipping not as efficient as it could be because it required > reading all page headers spread out throughout the file. I remember some > efforts in the Parquet community to add page level statistics to the footer. > > Now let's assume we have page level stats in the footer or have an efficient > way to collect that info. Then we have a query that covers two columns. Using > a predicate on the first column, we see that page 3 doesn't contain any > relevant values, so we can skip the entire page for that column. However, we > cannot just skip page 3 for the second column as the number of values within > a page is not fixed and might vary between column chunks. Basically, there is > no one-to-one mapping between pages. > > My question is if we can have a relatively efficient page skipping in Parquet > at this point. > > Thanks, > Anton