Hi everyone,
I am looking to increasing the value of the config
spark.sql.parquet.pushdown.inFilterThreshold to boost performance for some
queries I am looking at. While looking at the implementation in the Spark
Repo at
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L798
with
the following code snippet
case sources.In(name, values) if pushDownInFilterThreshold > 0 &&
values.nonEmpty &&
canMakeFilterOn(name, values.head) =>
val fieldType = nameToParquetField(name).fieldType
val fieldNames = nameToParquetField(name).fieldNames
if (values.length <= pushDownInFilterThreshold) {
values.distinct.flatMap { v =>
makeEq.lift(fieldType).map(_(fieldNames, v))
}.reduceLeftOption(FilterApi.or)
} else if (canPartialPushDownConjuncts) {
if (values.contains(null)) {
Seq(makeEq.lift(fieldType).map(_(fieldNames, null)),
makeInPredicate.lift(fieldType).map(_(fieldNames,
values.filter(_ != null)))
).flatten.reduceLeftOption(FilterApi.or)
} else {
makeInPredicate.lift(fieldType).map(_(fieldNames, values))
}
} else {
None
}
I see that when the number of items is less than or equal to
spark.sql.parquet.pushdown.inFilterThreshold in ParquetFilters.scala,
Parquet pushes ORs rather than an IN predicate. What are the advantages of
doing so?
Best Regards,
Yian