My take: OR will result in lining of the OR conditions , which means no Map lookup. So I suppose it would save on memory associated with Map creations ( & that too I suppose per partition ) and the lookup costs, when implemented using IN May be there are other reasons which I do not know...
Regards Asif On Tue, Sep 30, 2025 at 1:37 PM Yian Liou <[email protected]> wrote: > Hi everyone, > > I am looking to increasing the value of the config > spark.sql.parquet.pushdown.inFilterThreshold to boost performance for some > queries I am looking at. While looking at the implementation in the Spark > Repo at > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L798 > with > the following code snippet > > case sources.In(name, values) if pushDownInFilterThreshold > 0 && > values.nonEmpty && > canMakeFilterOn(name, values.head) => > val fieldType = nameToParquetField(name).fieldType > val fieldNames = nameToParquetField(name).fieldNames > if (values.length <= pushDownInFilterThreshold) { > values.distinct.flatMap { v => > makeEq.lift(fieldType).map(_(fieldNames, v)) > }.reduceLeftOption(FilterApi.or) > } else if (canPartialPushDownConjuncts) { > if (values.contains(null)) { > Seq(makeEq.lift(fieldType).map(_(fieldNames, null)), > makeInPredicate.lift(fieldType).map(_(fieldNames, > values.filter(_ != null))) > ).flatten.reduceLeftOption(FilterApi.or) > } else { > makeInPredicate.lift(fieldType).map(_(fieldNames, values)) > } > } else { > None > } > > I see that when the number of items is less than or equal to > spark.sql.parquet.pushdown.inFilterThreshold in ParquetFilters.scala, > Parquet pushes ORs rather than an IN predicate. What are the advantages of > doing so? > > Best Regards, > Yian >
