Thanks for that.  Strangely enough I was actually using 1.1.1 where it did
seem to be enabled by default.  Since upgrading to 1.2.0 and setting that
flag, I do get the expected result!  Looks good!

On Tue, Jan 6, 2015 at 12:17 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> Predicate push down into the input format is turned off by default because
> there is a bug in the current parquet library that null pointers when there
> are full row groups that are null.
>
> https://issues.apache.org/jira/browse/SPARK-4258
>
> You can turn it on if you want:
> http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration
>
> On Mon, Jan 5, 2015 at 3:38 PM, Adam Gilmore <dragoncu...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> I have a question regarding predicate pushdown for Parquet.
>>
>> My understanding was this would use the metadata in Parquet's
>> blocks/pages to skip entire chunks that won't match without needing to
>> decode the values and filter on every value in the table.
>>
>> I was testing a scenario where I had 100M rows in a Parquet file.
>>
>> Summing over a column took about 2-3 seconds.
>>
>> I also have a column (e.g. customer ID) with approximately 100 unique
>> values.  My assumption, though not exactly linear, would be that filtering
>> on this would reduce the query time significantly due to it skipping entire
>> segments based on the metadata.
>>
>> In fact, it took much longer - somewhere in the vicinity of 4-5 seconds,
>> which suggested to me it's reading all the values for the key column (100M
>> values), then filtering, then reading all the relevant segments/values for
>> the "measure" column, hence the increase in time.
>>
>> In the logs, I could see it was successfully pushing down a Parquet
>> predicate, so I'm not sure I'm understanding why this is taking longer.
>>
>> Could anyone shed some light on this or point out where I'm going wrong?
>> Thanks!
>>
>
>

Reply via email to