Re: [PR] [SPARK-51400] Replace ArrayContains nodes to InSet [spark]

via GitHub Wed, 19 Mar 2025 02:41:15 -0700


adrians commented on PR #50170:
URL: https://github.com/apache/spark/pull/50170#issuecomment-2735935628


   This change's effect ranges from "big performance gains" to "same level of 
performance", depending on what other optimizations happen at the same time.
   
   The cases with big performance gains:
   * partitioned tables, where the predicate uses the partitioning/bucketing 
column: `array_contains(..., col)` => instead of doing a full-table-scan, a 
partition-pruning step is introduced (as result of the predicate-pushdown).
   * file-formats with row-group statistics (as seen in the Parquet & ORC 
benchmarks) => instead of doing a full-table-scan, some row-groups are 
discarded as part of the pushdown.
   
   The cases with same level of performance:
   * dealing with file-formats with no row-group statistics (nothing to skip).
   * dealing with queries where the partitioning column is not involved in the 
`array_contains()` predicate.
   
   The "root cause" for this pull-request is that a query like `SELECT * FROM 
tbl WHERE col1 IN (1,3,5,7)` allows for a lot of optimizations, while `SELECT * 
FROM tbl WHERE array_contains(array(1,3,5,7), col1)` triggers full-table-scans 
(doesn't use row-group statistics, partition pruning).
   
   Of course, `array_contains` is more flexible (allows for queries like 
`SELECT * FROM tbl WHERE array_contains(from_json(:payload, :schema), col1)`, 
where the number of arguments in the array is not known at the parsing time), 
but my point is that if the argument of the `ArrayContains` node is resolved to 
a list of literals (possibly as a result of constant-folding steps), it is safe 
to convert it to the equivalent `col1 IN (val1, val2, val3)` expression.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51400] Replace ArrayContains nodes to InSet [spark]

Reply via email to