Hey All, A couple questions came up about shared variables recently, and I wanted to confirm my understanding and update the doc to be a little more clear.
*Broadcast variables* Now that tasks data is automatically broadcast, the only occasions where it makes sense to explicitly broadcast are: * You want to use a variable from tasks in multiple stages. * You want to have the variable stored on the executors in deserialized form. * You want tasks to be able to modify the variable and have those modifications take effect for other tasks running on the same executor (usually a very bad idea). Is that right? *Accumulators* Values are only counted for successful tasks. Is that right? KMeans seems to use it in this way. What happens if a node goes away and successful tasks need to be resubmitted? Or the stage runs again because a different job needed it. thanks, Sandy