I’ll add a new option to escape any spark options and put them directly into 
the SparkConf for the job before the context is created.

The CLI will be something like -D xxx=yyy so for this case you can change the 
default parallelism with 

-D spark.default.parallelism=400

If the logic holds that you can often have 16 to 8 x your number of cores then 
running locally on my laptop with local[7] should have -D 
spark.default.parallelism=112 or 56

If you want this value set for your entire cluster you should be able to set it 
in the conf files when you launch the cluster. We don’t change any of those 
values in the client except spark.executor.memory (only if specified) and any 
escaped values. 

On Oct 13, 2014, at 11:32 AM, Ted Dunning <[email protected]> wrote:

On Mon, Oct 13, 2014 at 12:32 PM, Reinis Vicups <[email protected]> wrote:

> 
> Do you think that simply increasing this parameter is a safe and sane
>> thing
>> to do?
>> 
> 
> Why would it be unsafe?
> 
> In my own implementation I am using 400 tasks on my 4-node-2cpu cluster
> and the execution times of largest shuffle stage have dropped around 10
> times.
> I have number of test values back from the time when I used "old"
> RowSimilarityJob and with some exceptions (I guess due to randomized
> sparsization) I still have approx. the same values with my own row
> similarity implementation.
> 

Splitting things too far can make processes much less efficient.  Setting
parameters like this may propagate further than desired.

I asked because I don't know, however.

Reply via email to