adrians commented on PR #50170: URL: https://github.com/apache/spark/pull/50170#issuecomment-2734739308
I've added a benchmark testcase in `FilterPushdownBenchmark.scala`, by mostly copy-pasting the existing `InSet` testcase. One run *without* the ArrayContains-to-InSet rule ([full logs](https://github.com/adrians/spark/actions/runs/13930126240/job/38985730088)). The arrayContains predicate is applied after a full-scan of the column, it cannot be pushed-down, even if the `spark.sql.parquet.filterPushdown` flag is enabled. Snippet below: ``` OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1021-azure AMD EPYC 7763 64-Core Processor ArrayContains -> InFilters (values count: 10, distribution: 90): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ----------------------------------------------------------------------------------------------------------------------------------------------- Parquet Vectorized 6686 6708 18 2.4 425.1 1.0X Parquet Vectorized (Pushdown) 6726 6779 64 2.3 427.6 1.0X Native ORC Vectorized 5051 5063 9 3.1 321.2 1.3X Native ORC Vectorized (Pushdown) 5153 5159 11 3.1 327.6 1.3X ``` One run *with* the ArrayContains-to-InSet rule ([full logs](https://github.com/adrians/spark/actions/runs/13925472778/job/38968725920)). The arrayContains predicate is transformed into inSet and, if possible, this predicate is pushed down towards the data-source (allowing pruning at page or row-group level). Snippet below: ``` OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1021-azure AMD EPYC 7763 64-Core Processor ArrayContains -> InFilters (values count: 10, distribution: 90): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ----------------------------------------------------------------------------------------------------------------------------------------------- Parquet Vectorized 6714 6726 15 2.3 426.8 1.0X Parquet Vectorized (Pushdown) 305 309 4 51.6 19.4 22.0X Native ORC Vectorized 4902 4918 10 3.2 311.6 1.4X Native ORC Vectorized (Pushdown) 305 313 9 51.5 19.4 22.0X ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org