As far as I understand updates of the custom accumulators at the driver side happen during task completion [1]. The documentation states [2] that the very last stage in a job consists of multiple ResultTasks, which execute the task and send its output back to the driver application. Also sources prove [3] that accumulators are updated for the ResultTasks just once.
So it's seems that accumulators are safe to use and will be executed only once regardless of where they are used (transformations as well as actions) in case these transformations and actions are belong to the last stage. Is it correct? Could anyone of Spark commiters or contributors please confirm it? [1] https://github.com/apache/spark/blob/3990daaf3b6ca2c5a9f7790030096262efb12cb2/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1194 [2] https://github.com/apache/spark/blob/a6fc300e91273230e7134ac6db95ccb4436c6f8f/core/src/main/scala/org/apache/spark/scheduler/Task.scala#L36 [3] https://github.com/apache/spark/blob/3990daaf3b6ca2c5a9f7790030096262efb12cb2/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1204 On Thu, May 10, 2018 at 10:24 PM, Sergey Zhemzhitsky <szh.s...@gmail.com> wrote: > Hi there, > > Although Spark's docs state that there is a guarantee that > - accumulators in actions will only be updated once > - accumulators in transformations may be updated multiple times > > ... I'm wondering whether the same is true for transformations in the > last stage of the job or there is a guarantee that accumulators > running in transformations of the last stage are guaranteed to be > updated once too? --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org