[PR] [Spark-50873][SQL] Prune column after RewriteSubquery rule for DSV2 [spark]

via GitHub Wed, 26 Mar 2025 00:54:19 -0700


Akeron-Zhu opened a new pull request, #50399:
URL: https://github.com/apache/spark/pull/50399


   ### What changes were proposed in this pull request?
   This PR offers an optimize rule for SparkOptimizer to prune unnecessary 
column for DataSourceV2 (DSV2) after RewriteSubquery. 
   Spark 3 use V2ScanRelationPushDown rule to prune column for DSV2. However, 
if there are subquerys in the qeuery sql, RewriteSubery rule will be generated 
new predicates which can be use to prune column after executed 
V2ScanRelationPushDown, but Spark does not prune column again which cause lower 
performance.
   See the issue for more detail description : 
[https://issues.apache.org/jira/browse/SPARK-50873](url)
   
   ### Why are the changes needed?
   A better performance for Spark DSV2. 
   For example, in 10T TPCDS test, the query16 execution time will be reduced 
by 50% from 2.5min to 1.3min in my cluster.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
    GitHub Actions.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [Spark-50873][SQL] Prune column after RewriteSubquery rule for DSV2 [spark]

Reply via email to