Hi,
I am a complete newbie to spark and map reduce frameworks and have a basic
question on the reduce function. I was working on the word count example
and was stuck at the reduce stage where the sum happens.
I am trying to understand the working of the reducebykey in Spark using
java as the programming language.
Say if I have a sentence "I am who I am" I break the sentence into words
and store as list [I, am, who, I, am]
now this function assigns 1 to each word
JavaPairRDD ones = words.mapToPair(new PairFunction() { @Override public
Tuple2 call(String s) { return new Tuple2(s, 1); } });
so the output is something like this
(I,1) (am,1) (who,1) (I,1) (am,1)
Now If I have 3 reducers running, each reducer will get a key and the
values associated to that key
reducer 1 : (I,1) (I,1)
reducer 2 : (am,1) (am,1)
reducer 3 : (who,1)
I wanted to know
a. what exactly happens here in the code below.
b.What are the parameters new Function2
c. Basically how the JavaPairRDD is formed.
JavaPairRDD counts = ones.reduceByKey(new Function2() { @Override public
Integer call(Integer i1, Integer i2) { return i1 + i2; } });
thanks !