AFAICT, not must have `pendingPartitions`, `mapOutputTrackerMaster` is added by a later change, `pendingPartitions` can be cleaned up.
-- Cheers, -z On Sun, 26 Apr 2020 11:53:09 +0200 Jacek Laskowski <ja...@japila.pl> wrote: > Hi, > > I found that ShuffleMapStage has this (apparently superfluous) > pendingPartitions registry [1] for DAGScheduler and the description says: > > " /** > * Partitions that either haven't yet been computed, or that were > computed on an executor > * that has since been lost, so should be re-computed. This variable is > used by the > * DAGScheduler to determine when a stage has completed. Task successes > in both the active > * attempt for the stage or in earlier attempts for this stage can cause > paritition ids to get > * removed from pendingPartitions. As a result, this variable may be > inconsistent with the pending > * tasks in the TaskSetManager for the active attempt for the stage (the > partitions stored here > * will always be a subset of the partitions that the TaskSetManager > thinks are pending). > */ > " > > I'm curious why there is a need for this pendingPartitions > since isAvailable or findMissingPartitions (using MapOutputTrackerMaster) > know it already and I think are even more up-to-date. Why is there this > extra registry? > > [1] > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapStage.scala#L60 > > Pozdrawiam, > Jacek Laskowski > ---- > https://about.me/JacekLaskowski > "The Internals Of" Online Books <https://books.japila.pl/> > Follow me on https://twitter.com/jaceklaskowski > > <https://twitter.com/jaceklaskowski> --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org