Akeron-Zhu opened a new pull request, #50399: URL: https://github.com/apache/spark/pull/50399
### What changes were proposed in this pull request? This PR offers an optimize rule for SparkOptimizer to prune unnecessary column for DataSourceV2 (DSV2) after RewriteSubquery. Spark 3 use V2ScanRelationPushDown rule to prune column for DSV2. However, if there are subquerys in the qeuery sql, RewriteSubery rule will be generated new predicates which can be use to prune column after executed V2ScanRelationPushDown, but Spark does not prune column again which cause lower performance. See the issue for more detail description : [https://issues.apache.org/jira/browse/SPARK-50873](url) ### Why are the changes needed? A better performance for Spark DSV2. For example, in 10T TPCDS test, the query16 execution time will be reduced by 50% from 2.5min to 1.3min in my cluster. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GitHub Actions. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org