I have tried the configuration calculator sheet provided by Cloudera as
well but no improvements. However, ignoring the 17 mil operation to begin
with.
Let consider the simple sort on yarn and spark which has tremendous
difference.
The operation is simple Selected numeric col to be sorted ascendi
What does your Spark job do? Have you tried standard configurations and
changing them gradually?
Have you checked the logfiles/ui which tasks take long?
17 Mio records does not sound much, but it depends what you do with it.
I do not think that for such a small "cluster" it makes sense to hav