[Spark SQL] Improving query Performance by increasing spark.sql.parquet.pushdown.inFilterThreshold config

Yian Liou Sat, 18 Oct 2025 06:27:51 -0700

Hi everyone,

I am looking to increasing the value of the config
spark.sql.parquet.pushdown.inFilterThreshold to boost performance for some
queries I am looking at. While looking at the implementation in the Spark
Repo at
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L798
with
the following code snippet


case sources.In(name, values) if pushDownInFilterThreshold > 0 &&
values.nonEmpty &&
          canMakeFilterOn(name, values.head) =>
        val fieldType = nameToParquetField(name).fieldType
        val fieldNames = nameToParquetField(name).fieldNames
        if (values.length <= pushDownInFilterThreshold) {
          values.distinct.flatMap { v =>
            makeEq.lift(fieldType).map(_(fieldNames, v))
          }.reduceLeftOption(FilterApi.or)
        } else if (canPartialPushDownConjuncts) {
          if (values.contains(null)) {
            Seq(makeEq.lift(fieldType).map(_(fieldNames, null)),
              makeInPredicate.lift(fieldType).map(_(fieldNames,
values.filter(_ != null)))
            ).flatten.reduceLeftOption(FilterApi.or)
          } else {
            makeInPredicate.lift(fieldType).map(_(fieldNames, values))
          }
        } else {
          None
        }

 I see that when the number of items is less than or equal to
spark.sql.parquet.pushdown.inFilterThreshold in ParquetFilters.scala,
Parquet pushes ORs rather than an IN predicate. What are the advantages of
doing so?

Best Regards,
Yian

[Spark SQL] Improving query Performance by increasing spark.sql.parquet.pushdown.inFilterThreshold config

Reply via email to