Hi, I am a complete newbie to spark and map reduce frameworks and have a basic question on the reduce function. I was working on the word count example and was stuck at the reduce stage where the sum happens.
I am trying to understand the working of the reducebykey in Spark using java as the programming language. Say if I have a sentence "I am who I am" I break the sentence into words and store as list [I, am, who, I, am] now this function assigns 1 to each word JavaPairRDD ones = words.mapToPair(new PairFunction() { @Override public Tuple2 call(String s) { return new Tuple2(s, 1); } }); so the output is something like this (I,1) (am,1) (who,1) (I,1) (am,1) Now If I have 3 reducers running, each reducer will get a key and the values associated to that key reducer 1 : (I,1) (I,1) reducer 2 : (am,1) (am,1) reducer 3 : (who,1) I wanted to know a. what exactly happens here in the code below. b.What are the parameters new Function2 c. Basically how the JavaPairRDD is formed. JavaPairRDD counts = ones.reduceByKey(new Function2() { @Override public Integer call(Integer i1, Integer i2) { return i1 + i2; } }); thanks !