I usually run interactively from the spark-shell. My data definitely has more than enough partitions to keep all the workers busy. When I first launch the cluster I first do:
+++++++++++++++++++++++++++++++++++++++++++++++++ cat <<EOF >>~/spark/conf/spark-defaults.conf spark.serializer org.apache.spark.serializer.KryoSerializer spark.rdd.compress true spark.shuffle.consolidateFiles true spark.akka.frameSize 20 EOF copy-dir /root/spark/conf spark/sbin/stop-all.sh sleep 5 spark/sbin/start-all.sh +++++++++++++++++++++++++++++++++++++++++++++++++ before starting the spark-shell or running any jobs. On Mon, Oct 20, 2014 at 2:57 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Perhaps your RDD is not partitioned enough to utilize all the cores in > your system. > > Could you post a simple code snippet and explain what kind of parallelism > you are seeing for it? And can you report on how many partitions your RDDs > have? > > On Mon, Oct 20, 2014 at 3:53 PM, Daniel Mahler <dmah...@gmail.com> wrote: > >> >> I am launching EC2 clusters using the spark-ec2 scripts. >> My understanding is that this configures spark to use the available >> resources. >> I can see that spark will use the available memory on larger istance >> types. >> However I have never seen spark running at more than 400% (using 100% on >> 4 cores) >> on machines with many more cores. >> Am I misunderstanding the docs? Is it just that high end ec2 instances >> get I/O starved when running spark? It would be strange if that >> consistently produced a 400% hard limit though. >> >> thanks >> Daniel >> > >