Re: Do developers have to be aware of Spark's fault tolerance mechanism?

Marcelo Vanzin Fri, 18 Apr 2014 17:55:27 -0700

Hi Sung,

On Fri, Apr 18, 2014 at 5:11 PM, Sung Hwan Chung
<coded...@cs.stanford.edu> wrote:
> while (true) {
>   rdd.map((row : Array[Double]) => {
>     row[numCols - 1] = computeSomething(row)
>   }).reduce(...)
> }
>
> If it fails at some point, I'd imagine that the intermediate info being
> stored in row[numCols - 1] will be lost. And unless Spark runs this whole
> thing from the very first iteration, things will get out of sync.


I'm not sure I completely follow what you're trying to achieve here.
But modifications to the "row" argument to the closure won't be seen
anywhere outside that closure. What the reduce() step will see will be
the output of the map step; in your code above, you're mapping the
input to nothing.

If you did this instead:

   rdd.map((row : Array[Double]) => {
     row(numCols - 1) = computeSomething(row)
     row
   }).reduce(...)

Then you'd be mapping the input to an array which just happens to be
the same as the input of the map function, with a modified item at
"numCols - 1". Note there isn't any state being kept in "row"; the
only "state" here is the output of the map function, which might be
serialized and sent somewhere else to be processed by the closure
passed to reduce().

If there's an error while processing the map function, Spark will
re-run it on a different worker node, and you should get the same
output (given that the input of your RDD is the same).

Also note that your example is always running exactly the same
computation, since it's always using the same RDD. If you want the
loop to do some sort of chaining, you'd need to create a new RDD from
the results of the reduce, and then perform your map/reduce operations
on that new RDD. And then the same rules as above would apply.

Just to illustrate what I'm trying to say, try running this in a spark shell:

var input = List(Array(1,2,3), Array(4,5,6))
var rows = sc.parallelize(input)

rows.map(row => {
  row(1) = row(1) * 2
  row
}).reduce((a1, a2) => Array(a1(0) + a2(0), a1(1) + a2(1), a1(2) + a2(2)))


And note how, after the computation runs, "input" still holds its
original value, even though the map function modified its input.


-- 
Marcelo

Re: Do developers have to be aware of Spark's fault tolerance mechanism?

Reply via email to