The biggest danger with gzipped files is this: >>> raw = sc.textFile("/path/to/file.gz", 8)>>> raw.getNumPartitions()1
You think you’re telling Spark to parallelize the reads on the input, but Spark cannot parallelize reads against gzipped files. So 1 gzipped file gets assigned to 1 partition. It might be a nice user hint if Spark warned when parallelism is disabled by the input format. Nick On Mon, Oct 20, 2014 at 6:53 PM, Daniel Mahler <dmah...@gmail.com> wrote: > Hi Nicholas, > > Gzipping is a an impressive guess! Yes, they are. > My data sets are too large to make repartitioning viable, but I could try > it on a subset. > I generally have many more partitions than cores. > This was happenning before I started setting those configs. > > thanks > Daniel > > > On Mon, Oct 20, 2014 at 5:37 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> Are you dealing with gzipped files by any chance? Does explicitly >> repartitioning your RDD to match the number of cores in your cluster help >> at all? How about if you don't specify the configs you listed and just go >> with defaults all around? >> >> On Mon, Oct 20, 2014 at 5:22 PM, Daniel Mahler <dmah...@gmail.com> wrote: >> >>> I launch the cluster using vanilla spark-ec2 scripts. >>> I just specify the number of slaves and instance type >>> >>> On Mon, Oct 20, 2014 at 4:07 PM, Daniel Mahler <dmah...@gmail.com> >>> wrote: >>> >>>> I usually run interactively from the spark-shell. >>>> My data definitely has more than enough partitions to keep all the >>>> workers busy. >>>> When I first launch the cluster I first do: >>>> >>>> +++++++++++++++++++++++++++++++++++++++++++++++++ >>>> cat <<EOF >>~/spark/conf/spark-defaults.conf >>>> spark.serializer org.apache.spark.serializer.KryoSerializer >>>> spark.rdd.compress true >>>> spark.shuffle.consolidateFiles true >>>> spark.akka.frameSize 20 >>>> EOF >>>> >>>> copy-dir /root/spark/conf >>>> spark/sbin/stop-all.sh >>>> sleep 5 >>>> spark/sbin/start-all.sh >>>> +++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>>> before starting the spark-shell or running any jobs. >>>> >>>> >>>> >>>> >>>> On Mon, Oct 20, 2014 at 2:57 PM, Nicholas Chammas < >>>> nicholas.cham...@gmail.com> wrote: >>>> >>>>> Perhaps your RDD is not partitioned enough to utilize all the cores in >>>>> your system. >>>>> >>>>> Could you post a simple code snippet and explain what kind of >>>>> parallelism you are seeing for it? And can you report on how many >>>>> partitions your RDDs have? >>>>> >>>>> On Mon, Oct 20, 2014 at 3:53 PM, Daniel Mahler <dmah...@gmail.com> >>>>> wrote: >>>>> >>>>>> >>>>>> I am launching EC2 clusters using the spark-ec2 scripts. >>>>>> My understanding is that this configures spark to use the available >>>>>> resources. >>>>>> I can see that spark will use the available memory on larger istance >>>>>> types. >>>>>> However I have never seen spark running at more than 400% (using 100% >>>>>> on 4 cores) >>>>>> on machines with many more cores. >>>>>> Am I misunderstanding the docs? Is it just that high end ec2 >>>>>> instances get I/O starved when running spark? It would be strange if that >>>>>> consistently produced a 400% hard limit though. >>>>>> >>>>>> thanks >>>>>> Daniel >>>>>> >>>>> >>>>> >>>> >>> >> >