Re: Partitioning Dataset and Using Reduce in Apache Spark

2015-03-13 Thread Daniel Siegmann
On Thu, Mar 12, 2015 at 1:45 AM, wrote: > > In your response you say “When you call reduce and *similar *methods, > each partition can be reduced in parallel. Then the results of that can be > transferred across the network and reduced to the final result”. By similar > methods do you mean all ac

Re: Partitioning Dataset and Using Reduce in Apache Spark

2015-03-11 Thread raghav0110.cs
Thank you very much for your detailed response, it was very informative and cleared up some of my misconceptions. After your explanation, I understand that the distribution of the data and parallelism is all meant to be an abstraction to the developer. In your response you say “When you ca

Re: Partitioning Dataset and Using Reduce in Apache Spark

2015-03-05 Thread Daniel Siegmann
An RDD is a Resilient *Distributed* Data set. The partitioning and distribution of the data happens in the background. You'll occasionally need to concern yourself with it (especially to get good performance), but from an API perspective it's mostly invisible (some methods do allow you to specify a