Re: Different Sorting RDD methods in Apache Spark

2015-06-09 Thread Mark Hamstra
> > Are you saying that sorting the entire data and collecting it on the > driver node is not a typical use case? It most definitely is not. Spark is designed and intended to be used with very large datasets. Far from being typical, collecting hundreds of gigabytes, terabytes or petabytes to th

Re: Different Sorting RDD methods in Apache Spark

2015-06-09 Thread Raghav Shankar
Thank you for you responses! You mention that it only works as long as the data fits on a single machine. What I am tying to do is receive the sorted contents of my dataset. For this to be possible, the entire dataset should be able to fit on a single machine. Are you saying that sorting the entir

Re: Different Sorting RDD methods in Apache Spark

2015-06-09 Thread Mark Hamstra
Correct. Trading away scalability for increased performance is not an option for the standard Spark API. On Tue, Jun 9, 2015 at 3:05 AM, Daniel Darabos < daniel.dara...@lynxanalytics.com> wrote: > It would be even faster to load the data on the driver and sort it there > without using Spark :).

Re: Different Sorting RDD methods in Apache Spark

2015-06-09 Thread Daniel Darabos
It would be even faster to load the data on the driver and sort it there without using Spark :). Using reduce() is cheating, because it only works as long as the data fits on one machine. That is not the targeted use case of a distributed computation system. You can repeat your test with more data