I am trying to use Apache spark to load up a file, and distribute the file to several nodes in my cluster and then aggregate the results and obtain them. I don't quite understand how to do this.
>From my understanding the reduce action enables Spark to combine the results from different nodes and aggregate them together. Am I understanding this correctly? >From a programming perspective, I don't understand how I would code this reduce function. How exactly do I partition the main dataset into N pieces and ask them to be parallel processed by using a list of transformations? reduce is supposed to take in two elements and a function for combining them. Are these 2 elements supposed to be RDDs from the context of Spark or can they be any type of element? Also, if you have N different partitions running parallel, how would reduce aggregate all their results into one final result(since the reduce function aggregates only 2 elements)? Also, I don't understand this example. The example from the spark website uses reduce, but I don't see the data being processed in parallel. So, what is the point of the reduce? If I could get a detailed explanation of the loop in this example, I think that would clear up most of my questions. class ComputeGradient extends Function<DataPoint, Vector> { private Vector w; ComputeGradient(Vector w) { this.w = w; } public Vector call(DataPoint p) { return p.x.times(p.y * (1 / (1 + Math.exp(w.dot(p.x))) - 1)); } } JavaRDD<DataPoint> points = spark.textFile(...).map(new ParsePoint()).cache(); Vector w = Vector.random(D); // current separating plane for (int i = 0; i < ITERATIONS; i++) { Vector gradient = points.map(new ComputeGradient(w)).reduce(new AddVectors()); w = w.subtract(gradient); } System.out.println("Final separating plane: " + w); Also, I have been trying to find the source code for reduce from the Apache Spark Github, but the source is pretty huge and I haven't been able to pinpoint it. Could someone please direct me towards which file I could find it in? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Partitioning-Dataset-and-Using-Reduce-in-Apache-Spark-tp21933.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org