Re: Number of partitions and Number of concurrent tasks

Daniel Siegmann Thu, 31 Jul 2014 07:06:09 -0700

I haven't configured this myself. I'd start with setting SPARK_WORKER_CORES
to a higher value, since that's a bit simpler than adding more workers.
This defaults to "all available cores" according to the documentation, so
I'm not sure if you can actually set it higher. If not, you can get around
this by adding more worker instances; I believe simply setting
SPARK_WORKER_INSTANCES to 2 would be sufficient.


I don't think you *have* to set the cores if you have more workers - it
will default to 8 cores per worker (in your case). But maybe 16 cores per
node will be too many. You'll have to test. Keep in mind that more workers
means more memory and such too, so you may need to tweak some other
settings downward in this case.

On a side note: I've read some people found performance was better when
they had more workers with less memory each, instead of a single worker
with tons of memory, because it cut down on garbage collection time. But I
can't speak to that myself.

In any case, if you increase the number of cores available in your cluster
(whether per worker, or adding more workers per node, or of course adding
more nodes) you should see more tasks running concurrently. Whether this
will actually be *faster* probably depends mainly on whether the CPUs in
your nodes were really being fully utilized with the current number of
cores.


On Wed, Jul 30, 2014 at 8:30 PM, Darin McBeath <ddmcbe...@yahoo.com> wrote:

> Thanks.
>
> So to make sure I understand.  Since I'm using a 'stand-alone' cluster, I
> would set SPARK_WORKER_INSTANCES to something like 2 (instead of the
> default value of 1).  Is that correct?  But, it also sounds like I need to
> explicitly set a value for SPARKER_WORKER_CORES (based on what the
> documentation states).  What would I want that value to be based on my
> configuration below?  Or, would I leave that alone?
>
>   ------------------------------
>  *From:* Daniel Siegmann <daniel.siegm...@velos.io>
> *To:* user@spark.apache.org; Darin McBeath <ddmcbe...@yahoo.com>
> *Sent:* Wednesday, July 30, 2014 5:58 PM
> *Subject:* Re: Number of partitions and Number of concurrent tasks
>
> This is correct behavior. Each "core" can execute exactly one task at a
> time, with each task corresponding to a partition. If your cluster only has
> 24 cores, you can only run at most 24 tasks at once.
>
> You could run multiple workers per node to get more executors. That would
> give you more cores in the cluster. But however many cores you have, each
> core will run only one task at a time.
>
>
> On Wed, Jul 30, 2014 at 3:56 PM, Darin McBeath <ddmcbe...@yahoo.com>
> wrote:
>
>  I have a cluster with 3 nodes (each with 8 cores) using Spark 1.0.1.
>
> I have an RDD<String> which I've repartitioned so it has 100 partitions
> (hoping to increase the parallelism).
>
> When I do a transformation (such as filter) on this RDD, I can't  seem to
> get more than 24 tasks (my total number of cores across the 3 nodes) going
> at one point in time.  By tasks, I mean the number of tasks that appear
> under the Application UI.  I tried explicitly setting the
> spark.default.parallelism to 48 (hoping I would get 48 tasks concurrently
> running) and verified this in the Application UI for the running
> application but this had no effect.  Perhaps, this is ignored for a
> 'filter' and the default is the total number of cores available.
>
> I'm fairly new with Spark so maybe I'm just missing or misunderstanding
> something fundamental.  Any help would be appreciated.
>
> Thanks.
>
> Darin.
>
>
>
>
> --
> Daniel Siegmann, Software Developer
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
> E: daniel.siegm...@velos.io W: www.velos.io
>
>
>


-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io

Re: Number of partitions and Number of concurrent tasks

Reply via email to