My take:
OR will result in  lining of the OR conditions , which means no Map lookup.
So I suppose it would save on memory associated with Map creations ( & that
too I suppose per partition )  and the lookup costs, when implemented using
IN
May be there are other reasons which I do not know...

Regards
Asif

On Tue, Sep 30, 2025 at 1:37 PM Yian Liou <[email protected]>
wrote:

> Hi everyone,
>
> I am looking to increasing the value of the config
> spark.sql.parquet.pushdown.inFilterThreshold to boost performance for some
> queries I am looking at. While looking at the implementation in the Spark
> Repo at
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L798
>  with
> the following code snippet
>
> case sources.In(name, values) if pushDownInFilterThreshold > 0 &&
> values.nonEmpty &&
>           canMakeFilterOn(name, values.head) =>
>         val fieldType = nameToParquetField(name).fieldType
>         val fieldNames = nameToParquetField(name).fieldNames
>         if (values.length <= pushDownInFilterThreshold) {
>           values.distinct.flatMap { v =>
>             makeEq.lift(fieldType).map(_(fieldNames, v))
>           }.reduceLeftOption(FilterApi.or)
>         } else if (canPartialPushDownConjuncts) {
>           if (values.contains(null)) {
>             Seq(makeEq.lift(fieldType).map(_(fieldNames, null)),
>               makeInPredicate.lift(fieldType).map(_(fieldNames,
> values.filter(_ != null)))
>             ).flatten.reduceLeftOption(FilterApi.or)
>           } else {
>             makeInPredicate.lift(fieldType).map(_(fieldNames, values))
>           }
>         } else {
>           None
>         }
>
>  I see that when the number of items is less than or equal to
> spark.sql.parquet.pushdown.inFilterThreshold in ParquetFilters.scala,
> Parquet pushes ORs rather than an IN predicate. What are the advantages of
> doing so?
>
> Best Regards,
> Yian
>

Reply via email to