Re: [SparkSQL] Try2: Parquet predicate pushdown troubles

2015-01-28 Thread Cheng Lian
Hey Yana, An update about this Parquet filter push-down issue. It turned out to be a bit complicated, but (hopefully) all clear now. 1. Yesterday I found a bug in Parquet, which essentially disables row group filtering for almost all |AND| predicates. * JIRA ticket: PARQUET-173

Re: [SparkSQL] Try2: Parquet predicate pushdown troubles

2015-01-21 Thread Cheng Lian
Oh yes, thanks for adding that using sc.hadoopConfiguration.set also works :-) ​ On Wed, Jan 21, 2015 at 7:11 AM, Yana Kadiyska wrote: > Thanks for looking Cheng. Just to clarify in case other people need this > sooner, setting sc.hadoopConfiguration.set("parquet.task.side.metadata"," > false")d

Re: [SparkSQL] Try2: Parquet predicate pushdown troubles

2015-01-21 Thread Yana Kadiyska
Thanks for looking Cheng. Just to clarify in case other people need this sooner, setting sc.hadoopConfiguration.set("parquet.task.side.metadata"," false")did work well in terms of dropping rowgroups/showing small input size. What was odd about that is that the overall time wasn't much better...but

Re: [SparkSQL] Try2: Parquet predicate pushdown troubles

2015-01-20 Thread Cheng Lian
Hey Yana, Sorry for the late reply, missed this important thread somehow. And many thanks for reporting this. It turned out to be a bug — filter pushdown is only enabled when using client side metadata, which is not expected, because task side metadata code path is more performant. And I guess

[SparkSQL] Try2: Parquet predicate pushdown troubles

2015-01-13 Thread Yana Kadiyska
Attempting to bump this up in case someone can help out after all. I spent a few good hours stepping through the code today, so I'll summarize my observations both in hope I get some help and to help others that might be looking into this: 1. I am setting *spark.sql.parquet.**filterPushdown=true*