I launch the cluster using vanilla spark-ec2 scripts. I just specify the number of slaves and instance type
On Mon, Oct 20, 2014 at 4:07 PM, Daniel Mahler <dmah...@gmail.com> wrote: > I usually run interactively from the spark-shell. > My data definitely has more than enough partitions to keep all the workers > busy. > When I first launch the cluster I first do: > > +++++++++++++++++++++++++++++++++++++++++++++++++ > cat <<EOF >>~/spark/conf/spark-defaults.conf > spark.serializer org.apache.spark.serializer.KryoSerializer > spark.rdd.compress true > spark.shuffle.consolidateFiles true > spark.akka.frameSize 20 > EOF > > copy-dir /root/spark/conf > spark/sbin/stop-all.sh > sleep 5 > spark/sbin/start-all.sh > +++++++++++++++++++++++++++++++++++++++++++++++++ > > before starting the spark-shell or running any jobs. > > > > > On Mon, Oct 20, 2014 at 2:57 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> Perhaps your RDD is not partitioned enough to utilize all the cores in >> your system. >> >> Could you post a simple code snippet and explain what kind of parallelism >> you are seeing for it? And can you report on how many partitions your RDDs >> have? >> >> On Mon, Oct 20, 2014 at 3:53 PM, Daniel Mahler <dmah...@gmail.com> wrote: >> >>> >>> I am launching EC2 clusters using the spark-ec2 scripts. >>> My understanding is that this configures spark to use the available >>> resources. >>> I can see that spark will use the available memory on larger istance >>> types. >>> However I have never seen spark running at more than 400% (using 100% on >>> 4 cores) >>> on machines with many more cores. >>> Am I misunderstanding the docs? Is it just that high end ec2 instances >>> get I/O starved when running spark? It would be strange if that >>> consistently produced a 400% hard limit though. >>> >>> thanks >>> Daniel >>> >> >> >