I’ll add a new option to escape any spark options and put them directly into the SparkConf for the job before the context is created.
The CLI will be something like -D xxx=yyy so for this case you can change the default parallelism with -D spark.default.parallelism=400 If the logic holds that you can often have 16 to 8 x your number of cores then running locally on my laptop with local[7] should have -D spark.default.parallelism=112 or 56 If you want this value set for your entire cluster you should be able to set it in the conf files when you launch the cluster. We don’t change any of those values in the client except spark.executor.memory (only if specified) and any escaped values. On Oct 13, 2014, at 11:32 AM, Ted Dunning <[email protected]> wrote: On Mon, Oct 13, 2014 at 12:32 PM, Reinis Vicups <[email protected]> wrote: > > Do you think that simply increasing this parameter is a safe and sane >> thing >> to do? >> > > Why would it be unsafe? > > In my own implementation I am using 400 tasks on my 4-node-2cpu cluster > and the execution times of largest shuffle stage have dropped around 10 > times. > I have number of test values back from the time when I used "old" > RowSimilarityJob and with some exceptions (I guess due to randomized > sparsization) I still have approx. the same values with my own row > similarity implementation. > Splitting things too far can make processes much less efficient. Setting parameters like this may propagate further than desired. I asked because I don't know, however.
