Sorry, but I haven't used Spark on EC2 and I'm not sure what the problem
could be. Hopefully someone else will be able to help. The only thing I
could suggest is to try setting both the worker instances and the number of
cores (assuming spark-ec2 has such a parameter).


On Thu, Jul 31, 2014 at 3:03 PM, Darin McBeath <ddmcbe...@yahoo.com> wrote:

> Ok, I set the number of spark worker instances to 2 (below is my startup
> command).  But, this essentially had the effect of increasing my number of
> workers from 3 to 6 (which was good) but it also reduced my number of cores
> per worker from 8 to 4 (which was not so good).  In the end, I would still
> only be able to concurrently process 24 partitions in parallel.  I'm
> starting a stand-alone cluster using the spark provided ec2 scripts .  I
> tried setting the env variable for SPARK_WORKER_CORES in the spark_ec2.py
> but this had no effect. So, it's not clear if I could even set the
> SPARK_WORKER_CORES with the ec2 scripts.  Anyway, not sure if there is
> anything else I can try but at least wanted to document what I did try and
> the net effect.  I'm open to any suggestions/advice.
>
>  ./spark-ec2 -k *key* -i key.pem --hadoop-major-version=2 launch -s 3 -t
> m3.2xlarge -w 3600 --spot-price=.08 -z us-east-1e --worker-instances=2
> *my-cluster*
>
>
>   ------------------------------
>  *From:* Daniel Siegmann <daniel.siegm...@velos.io>
> *To:* Darin McBeath <ddmcbe...@yahoo.com>
> *Cc:* Daniel Siegmann <daniel.siegm...@velos.io>; "user@spark.apache.org"
> <user@spark.apache.org>
> *Sent:* Thursday, July 31, 2014 10:04 AM
>
> *Subject:* Re: Number of partitions and Number of concurrent tasks
>
> I haven't configured this myself. I'd start with setting
> SPARK_WORKER_CORES to a higher value, since that's a bit simpler than
> adding more workers. This defaults to "all available cores" according to
> the documentation, so I'm not sure if you can actually set it higher. If
> not, you can get around this by adding more worker instances; I believe
> simply setting SPARK_WORKER_INSTANCES to 2 would be sufficient.
>
> I don't think you *have* to set the cores if you have more workers - it
> will default to 8 cores per worker (in your case). But maybe 16 cores per
> node will be too many. You'll have to test. Keep in mind that more workers
> means more memory and such too, so you may need to tweak some other
> settings downward in this case.
>
> On a side note: I've read some people found performance was better when
> they had more workers with less memory each, instead of a single worker
> with tons of memory, because it cut down on garbage collection time. But I
> can't speak to that myself.
>
> In any case, if you increase the number of cores available in your cluster
> (whether per worker, or adding more workers per node, or of course adding
> more nodes) you should see more tasks running concurrently. Whether this
> will actually be *faster* probably depends mainly on whether the CPUs in
> your nodes were really being fully utilized with the current number of
> cores.
>
>
> On Wed, Jul 30, 2014 at 8:30 PM, Darin McBeath <ddmcbe...@yahoo.com>
> wrote:
>
> Thanks.
>
> So to make sure I understand.  Since I'm using a 'stand-alone' cluster, I
> would set SPARK_WORKER_INSTANCES to something like 2 (instead of the
> default value of 1).  Is that correct?  But, it also sounds like I need to
> explicitly set a value for SPARKER_WORKER_CORES (based on what the
> documentation states).  What would I want that value to be based on my
> configuration below?  Or, would I leave that alone?
>
>   ------------------------------
>  *From:* Daniel Siegmann <daniel.siegm...@velos.io>
> *To:* user@spark.apache.org; Darin McBeath <ddmcbe...@yahoo.com>
> *Sent:* Wednesday, July 30, 2014 5:58 PM
> *Subject:* Re: Number of partitions and Number of concurrent tasks
>
> This is correct behavior. Each "core" can execute exactly one task at a
> time, with each task corresponding to a partition. If your cluster only has
> 24 cores, you can only run at most 24 tasks at once.
>
> You could run multiple workers per node to get more executors. That would
> give you more cores in the cluster. But however many cores you have, each
> core will run only one task at a time.
>
>
> On Wed, Jul 30, 2014 at 3:56 PM, Darin McBeath <ddmcbe...@yahoo.com>
> wrote:
>
>  I have a cluster with 3 nodes (each with 8 cores) using Spark 1.0.1.
>
> I have an RDD<String> which I've repartitioned so it has 100 partitions
> (hoping to increase the parallelism).
>
> When I do a transformation (such as filter) on this RDD, I can't  seem to
> get more than 24 tasks (my total number of cores across the 3 nodes) going
> at one point in time.  By tasks, I mean the number of tasks that appear
> under the Application UI.  I tried explicitly setting the
> spark.default.parallelism to 48 (hoping I would get 48 tasks concurrently
> running) and verified this in the Application UI for the running
> application but this had no effect.  Perhaps, this is ignored for a
> 'filter' and the default is the total number of cores available.
>
> I'm fairly new with Spark so maybe I'm just missing or misunderstanding
> something fundamental.  Any help would be appreciated.
>
> Thanks.
>
> Darin.
>
>
>
>
> --
> Daniel Siegmann, Software Developer
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
> E: daniel.siegm...@velos.io W: www.velos.io
>
>
>
>
>
> --
> Daniel Siegmann, Software Developer
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
> E: daniel.siegm...@velos.io W: www.velos.io
>
>
>


-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io

Reply via email to