On 12 Dec 2016, at 19:57, Daniel Siegmann 
<dsiegm...@securityscorecard.io<mailto:dsiegm...@securityscorecard.io>> wrote:

Accumulators are generally unreliable and should not be used. The answer to (2) 
and (4) is yes. The answer to (3) is both.

Here's a more in-depth explanation: 
http://imranrashid.com/posts/Spark-Accumulators/


That's a really nice article.

Accumulators work for generating statistics of things, such as bytes 
read/written (used internally for this), network errors ignored, etc. These 
things stay correct on retries: if you read more bytes, the byte counter should 
increase.

Where they are dangerous is they are treated as a real output of work, an 
answer to some query, albeit just a side effect. People have been doing that 
with MR counters since Hadoop 0.1x, so there's no need to feel bad about 
trying; everyone tries at one point. In Hadoop 1.x, trying to create too many 
counters would actually overload the entire job tracker; At some point a 
per-job limit went in for that reason; it's still in the MR code to keep costs 
down.

Spark's accumulators only use up your cluster's storage + extra data on the 
heartbeats, but because of retries it's less an accumulator of results, and 
more an accumulator of 'things that happened during one or more executions of a 
function against an RDD'

You should really be generating all output as the output of a series of 
functional operations on RDDs.

Reply via email to