Hi, Matei, Can you give some hint on how the current implementation guarantee the accumulator is only applied for once?
There is a pending PR trying to achieving this (https://github.com/apache/spark/pull/228/files), but from the current implementation, I didn’t see this has been done? (maybe I missed something) Best, -- Nan Zhu On Sunday, September 21, 2014 at 1:10 AM, Matei Zaharia wrote: > Hey Sandy, > > On September 20, 2014 at 8:50:54 AM, Sandy Ryza (sandy.r...@cloudera.com > (mailto:sandy.r...@cloudera.com)) wrote: > > Hey All, > > A couple questions came up about shared variables recently, and I wanted to > confirm my understanding and update the doc to be a little more clear. > > *Broadcast variables* > Now that tasks data is automatically broadcast, the only occasions where it > makes sense to explicitly broadcast are: > * You want to use a variable from tasks in multiple stages. > * You want to have the variable stored on the executors in deserialized > form. > * You want tasks to be able to modify the variable and have those > modifications take effect for other tasks running on the same executor > (usually a very bad idea). > > Is that right? > Yeah, pretty much. Reason 1 above is probably the biggest, but 2 also > matters. (We might later factor tasks in a different way to avoid 2, but it's > hard due to things like Hadoop JobConf objects in the tasks). > > > *Accumulators* > Values are only counted for successful tasks. Is that right? KMeans seems > to use it in this way. What happens if a node goes away and successful > tasks need to be resubmitted? Or the stage runs again because a different > job needed it. > Accumulators are guaranteed to give a deterministic result if you only > increment them in actions. For each result stage, the accumulator's update > from each task is only applied once, even if that task runs multiple times. > If you use accumulators in transformations (i.e. in a stage that may be part > of multiple jobs), then you may see multiple updates, from each run. This is > kind of confusing but it was useful for people who wanted to use these for > debugging. > > Matei > > > > > > thanks, > Sandy > >