[PR] [SPARK-52339][SQL] Fix comparison of `InMemoryFileIndex` instances [spark]

via GitHub Wed, 28 May 2025 13:12:09 -0700


bersprockets opened a new pull request, #51043:
URL: https://github.com/apache/spark/pull/51043


   ### What changes were proposed in this pull request?
   
   This PR changes `InMemoryFileIndex#equals` to compare a non-distinct 
collection of root paths rather than a distinct set of root paths. Without this 
change, `InMemoryFileIndex#equals` considers the following two collections of 
root paths to be equal, even though they represent a different number of rows:
   ```
   ["/tmp/test", "/tmp/test"]
   ["/tmp/test", "/tmp/test", "/tmp/test"]
   ```
   
   ### Why are the changes needed?
   
   The bug can cause correctness issues, e.g.
   ```
   // create test data
   val data = Seq((1, 2), (2, 3)).toDF("a", "b")
   data.write.mode("overwrite").csv("/tmp/test")
   
   val fileList1 = List.fill(2)("/tmp/test")
   val fileList2 = List.fill(3)("/tmp/test")
   
   val df1 = spark.read.schema("a int, b int").csv(fileList1: _*)
   val df2 = spark.read.schema("a int, b int").csv(fileList2: _*)
   
   df1.count() // correctly returns 4
   df2.count() // correctly returns 6
   
   // the following is the same as above, except df1 is persisted
   val df1 = spark.read.schema("a int, b int").csv(fileList1: _*).persist
   val df2 = spark.read.schema("a int, b int").csv(fileList2: _*)
   
   df1.count() // correctly returns 4
   df2.count() // incorrectly returns 4!!
   ```
   In the above example, df1 and df2 were created with a different number of 
paths: df1 has 2, and df2 has 3. But since the distinct set of root paths is 
the same (e.g., `Set("/tmp/test") == Set("/tmp/test"))`, the two dataframes are 
considered equal. Thus, when df1 is persisted, df2 uses df1's cached plan.
   
   The same bug also causes inappropriate exchange reuse.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   New unit test.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-52339][SQL] Fix comparison of `InMemoryFileIndex` instances [spark]

Reply via email to