Re: [PR] [SPARK-50994][CORE] Perform RDD conversion under tracked execution [spark]

via GitHub Sat, 22 Feb 2025 02:00:55 -0800


BOOTMGR commented on code in PR #49678:
URL: https://github.com/apache/spark/pull/49678#discussion_r1966485874



##########
sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala:
##########
@@ -2721,6 +2721,25 @@ class DataFrameSuite extends QueryTest
       parameters = Map("name" -> ".whatever")
     )
   }
+
+  test("SPARK-50994: RDD conversion is performed with execution context") {
+    withSQLConf(SQLConf.CASE_SENSITIVE.key -> "true") {

Review Comment:
   @cloud-fan  I took a close look at 
https://github.com/apache/spark/pull/48325 and I see that It takes stab at a 
bigger problem: `SQLConf` are not propagated when actual execution of RDD 
happens (when iterator is called) because that is triggered on-demand by user. 
This PR only ensures that when RDD is computed, It gets correct `SQLConf` but 
not during iterator traversal.
   
   I followed conversation there and I agree with you that all `SQLConf` 
accesses should have been done during RDD computation (by storing configs 
locally) but not when iterator is called. I also agree with @bersprockets 's 
view that fixing it everywhere would be troublesome and there is not guarantee 
for future additions. I believe that change needs some bigger considerations 
like how we see interoperability between Dataset and RDD. I am ready to 
volunteer there. 
   
   However, I feel this change should ship independently because 
   1. We need to have correct configs set when RDD computation happens. This is 
needed regardless of https://github.com/apache/spark/pull/48325 . We can wait 
for it later. 
   2. We need to have tracking on Spark UI for stages submitted during RDD 
computation. For example, Snowflake's official spark connector internally 
converts DF to RDD for serialising it into CSV format. Due to this, none of the 
dependent stages are show on Spark UI. 
   
   Let me know what you think.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-50994][CORE] Perform RDD conversion under tracked execution [spark]

Reply via email to