Hey Yana,
An update about this Parquet filter push-down issue. It turned out to be
a bit complicated, but (hopefully) all clear now.
1.
Yesterday I found a bug in Parquet, which essentially disables row
group filtering for almost all |AND| predicates.
* JIRA ticket: PARQUET-173
Oh yes, thanks for adding that using sc.hadoopConfiguration.set also works
:-)
On Wed, Jan 21, 2015 at 7:11 AM, Yana Kadiyska
wrote:
> Thanks for looking Cheng. Just to clarify in case other people need this
> sooner, setting sc.hadoopConfiguration.set("parquet.task.side.metadata","
> false")d
Thanks for looking Cheng. Just to clarify in case other people need this
sooner, setting sc.hadoopConfiguration.set("parquet.task.side.metadata","
false")did work well in terms of dropping rowgroups/showing small input
size. What was odd about that is that the overall time wasn't much
better...but
Hey Yana,
Sorry for the late reply, missed this important thread somehow. And many
thanks for reporting this. It turned out to be a bug — filter pushdown
is only enabled when using client side metadata, which is not expected,
because task side metadata code path is more performant. And I guess
Attempting to bump this up in case someone can help out after all. I spent
a few good hours stepping through the code today, so I'll summarize my
observations both in hope I get some help and to help others that might be
looking into this:
1. I am setting *spark.sql.parquet.**filterPushdown=true*