Re: [PR] [SPARK-51272][CORE]. Fix for the race condition in Scheduler causing failure in retrying all partitions in case of indeterministic shuffle keys [spark]

via GitHub Thu, 13 Mar 2025 17:41:01 -0700


attilapiros commented on code in PR #50033:
URL: https://github.com/apache/spark/pull/50033#discussion_r1994459647



##########
core/src/main/scala/org/apache/spark/scheduler/ShuffleMapStage.scala:
##########
@@ -90,8 +90,11 @@ private[spark] class ShuffleMapStage(
 
   /** Returns the sequence of partition ids that are missing (i.e. needs to be 
computed). */
   override def findMissingPartitions(): Seq[Int] = {
-    mapOutputTrackerMaster
-      .findMissingPartitions(shuffleDep.shuffleId)
-      .getOrElse(0 until numPartitions)
+    if (this.areAllPartitionsMissing(this.latestInfo.attemptNumber())) {

Review Comment:
   I think it would be much better to unregister all the map and merge output  
at the rollback of the stages (for each rolled back stage) as it is done 
currently for the barrier RDDs:
   
https://github.com/apache/spark/blob/1014c6babf57d08ba66bba3706bd350f4e9277dd/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L2116
   
   That way `areAllPartitionsMissing` and `attemptIdAllPartitionsMissing` won't 
be needed as `mapOutputTrackerMaster.findMissingPartitions` will give back all 
the partitions. 
   
   And for the `ResultStage` I wonder whether an `abortStage` would be 
sufficient.
   



##########
resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/BaseYarnClusterSuite.scala:
##########
@@ -167,7 +167,10 @@ abstract class BaseYarnClusterSuite extends SparkFunSuite 
with Matchers {
       extraJars: Seq[String] = Nil,
       extraConf: Map[String, String] = Map(),
       extraEnv: Map[String, String] = Map(),
-      outFile: Option[File] = None): SparkAppHandle.State = {
+      outFile: Option[File] = None,
+      testTimeOut: Int = 3, // minutes
+      timeOutIntervalCheck: Int = 1 // seconds

Review Comment:
   Why `Int` what about `Duration`?



##########
resource-managers/yarn/pom.xml:
##########
@@ -37,6 +37,11 @@
       <artifactId>spark-core_${scala.binary.version}</artifactId>
       <version>${project.version}</version>
     </dependency>
+    <dependency>
+      <groupId>org.apache.spark</groupId>
+      <artifactId>spark-sql_${scala.binary.version}</artifactId>
+      <version>${project.version}</version>
+    </dependency>

Review Comment:
   Please do not depend on `spark-sql` in the `spark-yarn`. Especially not in 
compile scope.
   I would prefer to have the test chnaged to use only RDDs.  



##########
scalastyle-config.xml:
##########
@@ -94,7 +94,7 @@ This file is divided into 3 sections:
   </check>
 
   <check customId="argcount" level="error" 
class="org.scalastyle.scalariform.ParameterNumberChecker" enabled="true">
-    <parameters><parameter 
name="maxParameters"><![CDATA[10]]></parameter></parameters>
+    <parameters><parameter 
name="maxParameters"><![CDATA[11]]></parameter></parameters>

Review Comment:
   Please revert this as this decision has effect to all the future source 
codes which requires a bigger discussion.
   
   I suggest to use:
   ```
   // scalastyle:off argcount
   // scalastyle:on
   ```
   
   Or case class containing both values as they are related.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51272][CORE]. Fix for the race condition in Scheduler causing failure in retrying all partitions in case of indeterministic shuffle keys [spark]

Reply via email to