[jira] [Commented] (SPARK-51831) No Column Pruning while using ExistJoin and datasource v2

Yang Jie (Jira) Fri, 18 Apr 2025 09:43:33 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-51831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17945263#comment-17945263
 ]


Yang Jie commented on SPARK-51831:
----------------------------------

cc [~cloud_fan] FYI

> No Column Pruning while using ExistJoin and datasource v2 
> ----------------------------------------------------------
>
>                 Key: SPARK-51831
>                 URL: https://issues.apache.org/jira/browse/SPARK-51831
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 4.0.0, 3.5.5
>            Reporter: Junqing Li
>            Priority: Major
>
> Recently, I have been testing TPC-DS queries based on DataSource V2, and 
> noticed that column pruning does not occur in scenarios involving {{{}EXISTS 
> (SELECT * FROM ... WHERE ...){}}}. As a result, the scan ends up reading all 
> columns instead of just the required ones. This issue is reproducible in 
> queries like Q10, Q16, Q35, Q69, and Q94.
> The following is a simplified example that demonstrates the problem.
> {code:java}
> test("Test exist join with v2 source plan") {
>     import org.apache.spark.sql.functions._
>     withTempPath { dir =>
>       spark.range(100)
>         .withColumn("col1", col("id") + 1)
>         .withColumn("col2", col("id") + 2)
>         .withColumn("col3", col("id") + 3)
>         .withColumn("col4", col("id") + 4)
>         .withColumn("col5", col("id") + 5)
>         .withColumn("col6", col("id") + 6)
>         .withColumn("col7", col("id") + 7)
>         .withColumn("col8", col("id") + 8)
>         .withColumn("col9", col("id") + 9)
>         .write
>         .mode("overwrite")
>         .parquet(dir.getCanonicalPath + "/t1")
>       spark.range(10).write.mode("overwrite").parquet(dir.getCanonicalPath + 
> "/t2")      Seq("parquet", "").foreach { v1SourceList =>
>         withSQLConf(SQLConf.USE_V1_SOURCE_LIST.key-> v1SourceList) {
>           spark.read.parquet(dir.getCanonicalPath + 
> "/t1").createOrReplaceTempView("t1")
>           spark.read.parquet(dir.getCanonicalPath + 
> "/t2").createOrReplaceTempView("t2")
>           spark.sql(
>             """
>               |select sum(t1.id) as sum_id
>               |from t1, t2
>               |where t1.id == t2.id
>               |      and exists(select * from t1 where t1.id == t2.id)
>               |""".stripMargin).explain()
>         }
>       }
>     }
>   } {code}
> After execution, we can observe the following:
>  * {*}With DataSource V1{*}, the query plan clearly shows column pruning — 
> only one column is read during the scan.
>  * 
> {code:java}
> BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, 
> true]),false), [plan_id=90]
>                         +- FileScan parquet [id#32L] Batched: true, 
> DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 
> paths)[file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-e0...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint> {code}
>  * {*}With DataSource V2{*}, the query plan reveals that no column pruning is 
> applied — all columns are read from the underlying data source.
>  * 
> {code:java}
> BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, 
> true]),false), [plan_id=152]
>                      +- Project [id#58L]
>                         +- BatchScan parquet 
> file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-e0ad680b-573c-4b11-b1b5-f50ee38fa81a/t1[id#58L,
>  col1#59L, col2#60L, col3#61L, col4#62L, col5#63L, col6#64L, col7#65L, 
> col8#66L, col9#67L] ParquetScan DataFilters: [], Format: parquet, Location: 
> InMemoryFileIndex(1 
> paths)[file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-e0...,
>  PartitionFilters: [], PushedAggregation: [], PushedFilters: [], 
> PushedGroupBy: [], ReadSchema: 
> struct<id:bigint,col1:bigint,col2:bigint,col3:bigint,col4:bigint,col5:bigint,col6:bigint,col7:big...
>  RuntimeFilters: []{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-51831) No Column Pruning while using ExistJoin and datasource v2

Reply via email to