adrians commented on PR #50170: URL: https://github.com/apache/spark/pull/50170#issuecomment-2735935628
This change's effect ranges from "big performance gains" to "same level of performance", depending on what other optimizations happen at the same time. The cases with big performance gains: * partitioned tables, where the predicate uses the partitioning/bucketing column: `array_contains(..., col)` => instead of doing a full-table-scan, a partition-pruning step is introduced (as result of the predicate-pushdown). * file-formats with row-group statistics (as seen in the Parquet & ORC benchmarks) => instead of doing a full-table-scan, some row-groups are discarded as part of the pushdown. The cases with same level of performance: * dealing with file-formats with no row-group statistics (nothing to skip). * dealing with queries where the partitioning column is not involved in the `array_contains()` predicate. The "root cause" for this pull-request is that a query like `SELECT * FROM tbl WHERE col1 IN (1,3,5,7)` allows for a lot of optimizations, while `SELECT * FROM tbl WHERE array_contains(array(1,3,5,7), col1)` triggers full-table-scans (doesn't use row-group statistics, partition pruning). Of course, `array_contains` is more flexible (allows for queries like `SELECT * FROM tbl WHERE array_contains(from_json(:payload, :schema), col1)`, where the number of arguments in the array is not known at the parsing time), but my point is that if the argument of the `ArrayContains` node is resolved to a list of literals (possibly as a result of constant-folding steps), it is safe to convert it to the equivalent `col1 IN (val1, val2, val3)` expression. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org