On 12 Dec 2016, at 19:57, Daniel Siegmann <dsiegm...@securityscorecard.io<mailto:dsiegm...@securityscorecard.io>> wrote:
Accumulators are generally unreliable and should not be used. The answer to (2) and (4) is yes. The answer to (3) is both. Here's a more in-depth explanation: http://imranrashid.com/posts/Spark-Accumulators/ That's a really nice article. Accumulators work for generating statistics of things, such as bytes read/written (used internally for this), network errors ignored, etc. These things stay correct on retries: if you read more bytes, the byte counter should increase. Where they are dangerous is they are treated as a real output of work, an answer to some query, albeit just a side effect. People have been doing that with MR counters since Hadoop 0.1x, so there's no need to feel bad about trying; everyone tries at one point. In Hadoop 1.x, trying to create too many counters would actually overload the entire job tracker; At some point a per-job limit went in for that reason; it's still in the MR code to keep costs down. Spark's accumulators only use up your cluster's storage + extra data on the heartbeats, but because of retries it's less an accumulator of results, and more an accumulator of 'things that happened during one or more executions of a function against an RDD' You should really be generating all output as the output of a series of functional operations on RDDs.