Hi Nicholas, Gzipping is a an impressive guess! Yes, they are. My data sets are too large to make repartitioning viable, but I could try it on a subset. I generally have many more partitions than cores. This was happenning before I started setting those configs.
thanks Daniel On Mon, Oct 20, 2014 at 5:37 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Are you dealing with gzipped files by any chance? Does explicitly > repartitioning your RDD to match the number of cores in your cluster help > at all? How about if you don't specify the configs you listed and just go > with defaults all around? > > On Mon, Oct 20, 2014 at 5:22 PM, Daniel Mahler <dmah...@gmail.com> wrote: > >> I launch the cluster using vanilla spark-ec2 scripts. >> I just specify the number of slaves and instance type >> >> On Mon, Oct 20, 2014 at 4:07 PM, Daniel Mahler <dmah...@gmail.com> wrote: >> >>> I usually run interactively from the spark-shell. >>> My data definitely has more than enough partitions to keep all the >>> workers busy. >>> When I first launch the cluster I first do: >>> >>> +++++++++++++++++++++++++++++++++++++++++++++++++ >>> cat <<EOF >>~/spark/conf/spark-defaults.conf >>> spark.serializer org.apache.spark.serializer.KryoSerializer >>> spark.rdd.compress true >>> spark.shuffle.consolidateFiles true >>> spark.akka.frameSize 20 >>> EOF >>> >>> copy-dir /root/spark/conf >>> spark/sbin/stop-all.sh >>> sleep 5 >>> spark/sbin/start-all.sh >>> +++++++++++++++++++++++++++++++++++++++++++++++++ >>> >>> before starting the spark-shell or running any jobs. >>> >>> >>> >>> >>> On Mon, Oct 20, 2014 at 2:57 PM, Nicholas Chammas < >>> nicholas.cham...@gmail.com> wrote: >>> >>>> Perhaps your RDD is not partitioned enough to utilize all the cores in >>>> your system. >>>> >>>> Could you post a simple code snippet and explain what kind of >>>> parallelism you are seeing for it? And can you report on how many >>>> partitions your RDDs have? >>>> >>>> On Mon, Oct 20, 2014 at 3:53 PM, Daniel Mahler <dmah...@gmail.com> >>>> wrote: >>>> >>>>> >>>>> I am launching EC2 clusters using the spark-ec2 scripts. >>>>> My understanding is that this configures spark to use the available >>>>> resources. >>>>> I can see that spark will use the available memory on larger istance >>>>> types. >>>>> However I have never seen spark running at more than 400% (using 100% >>>>> on 4 cores) >>>>> on machines with many more cores. >>>>> Am I misunderstanding the docs? Is it just that high end ec2 instances >>>>> get I/O starved when running spark? It would be strange if that >>>>> consistently produced a 400% hard limit though. >>>>> >>>>> thanks >>>>> Daniel >>>>> >>>> >>>> >>> >> >