Hi all, I have a question regarding predicate pushdown for Parquet.
My understanding was this would use the metadata in Parquet's blocks/pages to skip entire chunks that won't match without needing to decode the values and filter on every value in the table. I was testing a scenario where I had 100M rows in a Parquet file. Summing over a column took about 2-3 seconds. I also have a column (e.g. customer ID) with approximately 100 unique values. My assumption, though not exactly linear, would be that filtering on this would reduce the query time significantly due to it skipping entire segments based on the metadata. In fact, it took much longer - somewhere in the vicinity of 4-5 seconds, which suggested to me it's reading all the values for the key column (100M values), then filtering, then reading all the relevant segments/values for the "measure" column, hence the increase in time. In the logs, I could see it was successfully pushing down a Parquet predicate, so I'm not sure I'm understanding why this is taking longer. Could anyone shed some light on this or point out where I'm going wrong? Thanks!