jackylee-ch opened a new pull request, #52458:
URL: https://github.com/apache/spark/pull/52458

   ### Why are the changes needed?
   Recently, I have been testing TPC-DS queries based on DataSource V2, and 
noticed that column pruning does not occur in scenarios involving EXISTS 
(SELECT * FROM ... WHERE ...). As a result, the scan ends up reading all 
columns instead of just the required ones. This issue is reproducible in 
queries like Q10, Q16, Q35, Q69, and Q94.
   
   This PR inserts a `Project` into the `Subquery`, ensuring that only the 
referenced columns are read from DataSource V2.
   
   Below is the plan changes for this pr.
   Before this PR
   ```
   BatchScan parquet 
file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-76b1f4fc-2e84-485c-aade-a62168987baf/t1[id#32L,
 col1#33L, col2#34L, col3#35L, col4#36L, col5#37L, col6#38L, col7#39L, 
col8#40L, col9#41L] ParquetScan DataFilters: [isnotnull(col1#33L), (col1#33L > 
5)], Format: parquet, Location: InMemoryFileIndex(1 
paths)[file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-76...,
 PartitionFilters: [], PushedAggregation: [], PushedFilters: [IsNotNull(col1), 
GreaterThan(col1,5)], PushedGroupBy: [], ReadSchema: 
struct<id:bigint,col1:bigint,col2:bigint,col3:bigint,col4:bigint,col5:bigint,col6:bigint,col7:big...
 RuntimeFilters: []
   ```
   After this PR
   ```
   BatchScan parquet 
file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-cd4b50d9-1643-40e6-a8e1-1429d3213411/t1[id#133L,
 col1#134L] ParquetScan DataFilters: [isnotnull(col1#134L), (col1#134L > 5)], 
Format: parquet, Location: InMemoryFileIndex(1 
paths)[file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-cd...,
 PartitionFilters: [], PushedAggregation: [], PushedFilters: [IsNotNull(col1), 
GreaterThan(col1,5)], PushedGroupBy: [], ReadSchema: 
struct<id:bigint,col1:bigint> RuntimeFilters: []
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Newly added UT.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to