Re: [PR] [SPARK-51756][CORE] Computes RowBasedChecksum in ShuffleWriters [spark]

via GitHub Sat, 12 Apr 2025 00:43:01 -0700


mridulm commented on code in PR #50230:
URL: https://github.com/apache/spark/pull/50230#discussion_r2040114948



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -5724,6 +5724,21 @@ object SQLConf {
       .booleanConf
       .createWithDefault(true)
 
+  val SHUFFLE_ORDER_INDEPENDENT_CHECKSUM_ENABLED =
+    buildConf("spark.shuffle.orderIndependentChecksum.enabled")
+      .doc("Whether to calculate order independent checksum for the shuffle 
data or not. If " +
+        "enabled, Spark will calculate a checksum that is independent of the 
input row order for " +
+        "each mapper and returns the checksums from executors to driver. 
Different from the above" +
+        "checksum, the order independent remains the same even if the shuffle 
row order changes. " +
+        "While the above checksum is sensitive to shuffle data ordering to 
detect file " +
+        "corruption. This checksum is used to detect whether different task 
attempts of the same " +
+        "partition produce different output data or not (same set of keyValue 
pairs). In case " +
+        "the output data has changed across retries, Spark will need to retry 
all tasks of the " +
+        "consumer stages to avoid correctness issues.")
+      .version("4.1.0")

Review Comment:
   @cloud-fan, it is too late for 4.0 - let us move it to 4.1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51756][CORE] Computes RowBasedChecksum in ShuffleWriters [spark]

Reply via email to