>
> Are you saying that sorting the entire data and collecting it on the
> driver node is not a typical use case?
It most definitely is not. Spark is designed and intended to be used with
very large datasets. Far from being typical, collecting hundreds of
gigabytes, terabytes or petabytes to th
Thank you for you responses!
You mention that it only works as long as the data fits on a single
machine. What I am tying to do is receive the sorted contents of my
dataset. For this to be possible, the entire dataset should be able to fit
on a single machine. Are you saying that sorting the entir
Correct. Trading away scalability for increased performance is not an
option for the standard Spark API.
On Tue, Jun 9, 2015 at 3:05 AM, Daniel Darabos <
daniel.dara...@lynxanalytics.com> wrote:
> It would be even faster to load the data on the driver and sort it there
> without using Spark :).
It would be even faster to load the data on the driver and sort it there
without using Spark :). Using reduce() is cheating, because it only works
as long as the data fits on one machine. That is not the targeted use case
of a distributed computation system. You can repeat your test with more
data