attilapiros commented on code in PR #50033: URL: https://github.com/apache/spark/pull/50033#discussion_r1994459647
########## core/src/main/scala/org/apache/spark/scheduler/ShuffleMapStage.scala: ########## @@ -90,8 +90,11 @@ private[spark] class ShuffleMapStage( /** Returns the sequence of partition ids that are missing (i.e. needs to be computed). */ override def findMissingPartitions(): Seq[Int] = { - mapOutputTrackerMaster - .findMissingPartitions(shuffleDep.shuffleId) - .getOrElse(0 until numPartitions) + if (this.areAllPartitionsMissing(this.latestInfo.attemptNumber())) { Review Comment: I think it would be much better to unregister all the map and merge output at the rollback of the stages (for each rolled back stage) as it is done currently for the barrier RDDs: https://github.com/apache/spark/blob/1014c6babf57d08ba66bba3706bd350f4e9277dd/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L2116 That way `areAllPartitionsMissing` and `attemptIdAllPartitionsMissing` won't be needed as `mapOutputTrackerMaster.findMissingPartitions` will give back all the partitions. And for the `ResultStage` I wonder whether an `abortStage` would be sufficient. ########## resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/BaseYarnClusterSuite.scala: ########## @@ -167,7 +167,10 @@ abstract class BaseYarnClusterSuite extends SparkFunSuite with Matchers { extraJars: Seq[String] = Nil, extraConf: Map[String, String] = Map(), extraEnv: Map[String, String] = Map(), - outFile: Option[File] = None): SparkAppHandle.State = { + outFile: Option[File] = None, + testTimeOut: Int = 3, // minutes + timeOutIntervalCheck: Int = 1 // seconds Review Comment: Why `Int` what about `Duration`? ########## resource-managers/yarn/pom.xml: ########## @@ -37,6 +37,11 @@ <artifactId>spark-core_${scala.binary.version}</artifactId> <version>${project.version}</version> </dependency> + <dependency> + <groupId>org.apache.spark</groupId> + <artifactId>spark-sql_${scala.binary.version}</artifactId> + <version>${project.version}</version> + </dependency> Review Comment: Please do not depend on `spark-sql` in the `spark-yarn`. Especially not in compile scope. I would prefer to have the test chnaged to use only RDDs. ########## scalastyle-config.xml: ########## @@ -94,7 +94,7 @@ This file is divided into 3 sections: </check> <check customId="argcount" level="error" class="org.scalastyle.scalariform.ParameterNumberChecker" enabled="true"> - <parameters><parameter name="maxParameters"><![CDATA[10]]></parameter></parameters> + <parameters><parameter name="maxParameters"><![CDATA[11]]></parameter></parameters> Review Comment: Please revert this as this decision has effect to all the future source codes which requires a bigger discussion. I suggest to use: ``` // scalastyle:off argcount // scalastyle:on ``` Or case class containing both values as they are related. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org