Maybe the better approach is to understand why your job isnt scaling -
what does the UI show? are the resources actually the same? for
example do you have more than 8 cores in the local setup?
Is there enough parallelism? for example it doesn't look like the
small input is repartitioned to at least
Thanks - that's great I'll check out both spark-bench and SparkPi.
I do have more than 8 cores in the local setup. 24 cores in total (12 per
machine).
However on AWS with the same cluster setup, that is not the case; I chose
Medium size instances hoping that a much smaller instance since would s
I mean that local[*] = all cores on the machines, whereas in your
example you seem to be choosing 8 cores per executor in the
distributed case. You'd have 12 cores in your local case - which is
still less than 2x8, but just the kind of thing to consider when
comparing these setups.
Indeed, how wel