adrians commented on PR #50170:
URL: https://github.com/apache/spark/pull/50170#issuecomment-2734739308

   I've added a benchmark testcase in `FilterPushdownBenchmark.scala`, by 
mostly copy-pasting the existing `InSet` testcase.
   
   One run *without* the ArrayContains-to-InSet rule ([full 
logs](https://github.com/adrians/spark/actions/runs/13930126240/job/38985730088)).
 The arrayContains predicate is applied after a full-scan of the column, it 
cannot be pushed-down, even if the `spark.sql.parquet.filterPushdown` flag is 
enabled. Snippet below:
   
   ```
   OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1021-azure
   AMD EPYC 7763 64-Core Processor
   ArrayContains -> InFilters (values count: 10, distribution: 90):  Best 
Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
-----------------------------------------------------------------------------------------------------------------------------------------------
   Parquet Vectorized                                                        
6686           6708          18          2.4         425.1       1.0X
   Parquet Vectorized (Pushdown)                                             
6726           6779          64          2.3         427.6       1.0X
   Native ORC Vectorized                                                     
5051           5063           9          3.1         321.2       1.3X
   Native ORC Vectorized (Pushdown)                                          
5153           5159          11          3.1         327.6       1.3X
   ```
   
   One run *with* the ArrayContains-to-InSet rule ([full 
logs](https://github.com/adrians/spark/actions/runs/13925472778/job/38968725920)).
 The arrayContains predicate is transformed into inSet and, if possible, this 
predicate is pushed down towards the data-source (allowing pruning at page or 
row-group level). Snippet below:
   
   ```
   OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1021-azure
   AMD EPYC 7763 64-Core Processor
   ArrayContains -> InFilters (values count: 10, distribution: 90):  Best 
Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
-----------------------------------------------------------------------------------------------------------------------------------------------
   Parquet Vectorized                                                        
6714           6726          15          2.3         426.8       1.0X
   Parquet Vectorized (Pushdown)                                              
305            309           4         51.6          19.4      22.0X
   Native ORC Vectorized                                                     
4902           4918          10          3.2         311.6       1.4X
   Native ORC Vectorized (Pushdown)                                           
305            313           9         51.5          19.4      22.0X
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to