Thanks for that. Strangely enough I was actually using 1.1.1 where it did seem to be enabled by default. Since upgrading to 1.2.0 and setting that flag, I do get the expected result! Looks good!
On Tue, Jan 6, 2015 at 12:17 PM, Michael Armbrust <mich...@databricks.com> wrote: > Predicate push down into the input format is turned off by default because > there is a bug in the current parquet library that null pointers when there > are full row groups that are null. > > https://issues.apache.org/jira/browse/SPARK-4258 > > You can turn it on if you want: > http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration > > On Mon, Jan 5, 2015 at 3:38 PM, Adam Gilmore <dragoncu...@gmail.com> > wrote: > >> Hi all, >> >> I have a question regarding predicate pushdown for Parquet. >> >> My understanding was this would use the metadata in Parquet's >> blocks/pages to skip entire chunks that won't match without needing to >> decode the values and filter on every value in the table. >> >> I was testing a scenario where I had 100M rows in a Parquet file. >> >> Summing over a column took about 2-3 seconds. >> >> I also have a column (e.g. customer ID) with approximately 100 unique >> values. My assumption, though not exactly linear, would be that filtering >> on this would reduce the query time significantly due to it skipping entire >> segments based on the metadata. >> >> In fact, it took much longer - somewhere in the vicinity of 4-5 seconds, >> which suggested to me it's reading all the values for the key column (100M >> values), then filtering, then reading all the relevant segments/values for >> the "measure" column, hence the increase in time. >> >> In the logs, I could see it was successfully pushing down a Parquet >> predicate, so I'm not sure I'm understanding why this is taking longer. >> >> Could anyone shed some light on this or point out where I'm going wrong? >> Thanks! >> > >