An RDD is a Resilient *Distributed* Data set. The partitioning and distribution of the data happens in the background. You'll occasionally need to concern yourself with it (especially to get good performance), but from an API perspective it's mostly invisible (some methods do allow you to specify a number of partitions).
When you call sc.textFile(myPath) or similar, you get an RDD. That RDD will be composed of a bunch of partitions, but you don't really need to worry about that. The partitioning will be based on how the data is stored. When you call a method that causes a shuffle (such as reduce), the data is repartitioned into a number of partitions based on your default parallelism setting (which IIRC is based on your number of cores if you haven't set it explicitly). When you call reduce and similar methods, each partition can be reduced in parallel. Then the results of that can be transferred across the network and reduced to the final result. *You supply the function and Spark handles the parallel execution of that function*. I hope this helps clear up your misconceptions. You might also want to familiarize yourself with the collections API in Java 8 (or Scala, or Python, or pretty much any other language with lambda expressions), since RDDs are meant to have an API that feels similar. On Thu, Mar 5, 2015 at 9:45 AM, raggy <raghav0110...@gmail.com> wrote: > I am trying to use Apache spark to load up a file, and distribute the file > to > several nodes in my cluster and then aggregate the results and obtain them. > I don't quite understand how to do this. > > From my understanding the reduce action enables Spark to combine the > results > from different nodes and aggregate them together. Am I understanding this > correctly? > > From a programming perspective, I don't understand how I would code this > reduce function. > > How exactly do I partition the main dataset into N pieces and ask them to > be > parallel processed by using a list of transformations? > > reduce is supposed to take in two elements and a function for combining > them. Are these 2 elements supposed to be RDDs from the context of Spark or > can they be any type of element? Also, if you have N different partitions > running parallel, how would reduce aggregate all their results into one > final result(since the reduce function aggregates only 2 elements)? > > Also, I don't understand this example. The example from the spark website > uses reduce, but I don't see the data being processed in parallel. So, what > is the point of the reduce? If I could get a detailed explanation of the > loop in this example, I think that would clear up most of my questions. > > class ComputeGradient extends Function<DataPoint, Vector> { > private Vector w; > ComputeGradient(Vector w) { this.w = w; } > public Vector call(DataPoint p) { > return p.x.times(p.y * (1 / (1 + Math.exp(w.dot(p.x))) - 1)); > } > } > > JavaRDD<DataPoint> points = spark.textFile(...).map(new > ParsePoint()).cache(); > Vector w = Vector.random(D); // current separating plane > for (int i = 0; i < ITERATIONS; i++) { > Vector gradient = points.map(new ComputeGradient(w)).reduce(new > AddVectors()); > w = w.subtract(gradient); > } > System.out.println("Final separating plane: " + w); > > Also, I have been trying to find the source code for reduce from the Apache > Spark Github, but the source is pretty huge and I haven't been able to > pinpoint it. Could someone please direct me towards which file I could find > it in? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Partitioning-Dataset-and-Using-Reduce-in-Apache-Spark-tp21933.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >