Sorry, but I haven't used Spark on EC2 and I'm not sure what the problem could be. Hopefully someone else will be able to help. The only thing I could suggest is to try setting both the worker instances and the number of cores (assuming spark-ec2 has such a parameter).
On Thu, Jul 31, 2014 at 3:03 PM, Darin McBeath <ddmcbe...@yahoo.com> wrote: > Ok, I set the number of spark worker instances to 2 (below is my startup > command). But, this essentially had the effect of increasing my number of > workers from 3 to 6 (which was good) but it also reduced my number of cores > per worker from 8 to 4 (which was not so good). In the end, I would still > only be able to concurrently process 24 partitions in parallel. I'm > starting a stand-alone cluster using the spark provided ec2 scripts . I > tried setting the env variable for SPARK_WORKER_CORES in the spark_ec2.py > but this had no effect. So, it's not clear if I could even set the > SPARK_WORKER_CORES with the ec2 scripts. Anyway, not sure if there is > anything else I can try but at least wanted to document what I did try and > the net effect. I'm open to any suggestions/advice. > > ./spark-ec2 -k *key* -i key.pem --hadoop-major-version=2 launch -s 3 -t > m3.2xlarge -w 3600 --spot-price=.08 -z us-east-1e --worker-instances=2 > *my-cluster* > > > ------------------------------ > *From:* Daniel Siegmann <daniel.siegm...@velos.io> > *To:* Darin McBeath <ddmcbe...@yahoo.com> > *Cc:* Daniel Siegmann <daniel.siegm...@velos.io>; "user@spark.apache.org" > <user@spark.apache.org> > *Sent:* Thursday, July 31, 2014 10:04 AM > > *Subject:* Re: Number of partitions and Number of concurrent tasks > > I haven't configured this myself. I'd start with setting > SPARK_WORKER_CORES to a higher value, since that's a bit simpler than > adding more workers. This defaults to "all available cores" according to > the documentation, so I'm not sure if you can actually set it higher. If > not, you can get around this by adding more worker instances; I believe > simply setting SPARK_WORKER_INSTANCES to 2 would be sufficient. > > I don't think you *have* to set the cores if you have more workers - it > will default to 8 cores per worker (in your case). But maybe 16 cores per > node will be too many. You'll have to test. Keep in mind that more workers > means more memory and such too, so you may need to tweak some other > settings downward in this case. > > On a side note: I've read some people found performance was better when > they had more workers with less memory each, instead of a single worker > with tons of memory, because it cut down on garbage collection time. But I > can't speak to that myself. > > In any case, if you increase the number of cores available in your cluster > (whether per worker, or adding more workers per node, or of course adding > more nodes) you should see more tasks running concurrently. Whether this > will actually be *faster* probably depends mainly on whether the CPUs in > your nodes were really being fully utilized with the current number of > cores. > > > On Wed, Jul 30, 2014 at 8:30 PM, Darin McBeath <ddmcbe...@yahoo.com> > wrote: > > Thanks. > > So to make sure I understand. Since I'm using a 'stand-alone' cluster, I > would set SPARK_WORKER_INSTANCES to something like 2 (instead of the > default value of 1). Is that correct? But, it also sounds like I need to > explicitly set a value for SPARKER_WORKER_CORES (based on what the > documentation states). What would I want that value to be based on my > configuration below? Or, would I leave that alone? > > ------------------------------ > *From:* Daniel Siegmann <daniel.siegm...@velos.io> > *To:* user@spark.apache.org; Darin McBeath <ddmcbe...@yahoo.com> > *Sent:* Wednesday, July 30, 2014 5:58 PM > *Subject:* Re: Number of partitions and Number of concurrent tasks > > This is correct behavior. Each "core" can execute exactly one task at a > time, with each task corresponding to a partition. If your cluster only has > 24 cores, you can only run at most 24 tasks at once. > > You could run multiple workers per node to get more executors. That would > give you more cores in the cluster. But however many cores you have, each > core will run only one task at a time. > > > On Wed, Jul 30, 2014 at 3:56 PM, Darin McBeath <ddmcbe...@yahoo.com> > wrote: > > I have a cluster with 3 nodes (each with 8 cores) using Spark 1.0.1. > > I have an RDD<String> which I've repartitioned so it has 100 partitions > (hoping to increase the parallelism). > > When I do a transformation (such as filter) on this RDD, I can't seem to > get more than 24 tasks (my total number of cores across the 3 nodes) going > at one point in time. By tasks, I mean the number of tasks that appear > under the Application UI. I tried explicitly setting the > spark.default.parallelism to 48 (hoping I would get 48 tasks concurrently > running) and verified this in the Application UI for the running > application but this had no effect. Perhaps, this is ignored for a > 'filter' and the default is the total number of cores available. > > I'm fairly new with Spark so maybe I'm just missing or misunderstanding > something fundamental. Any help would be appreciated. > > Thanks. > > Darin. > > > > > -- > Daniel Siegmann, Software Developer > Velos > Accelerating Machine Learning > > 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 > E: daniel.siegm...@velos.io W: www.velos.io > > > > > > -- > Daniel Siegmann, Software Developer > Velos > Accelerating Machine Learning > > 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 > E: daniel.siegm...@velos.io W: www.velos.io > > > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io