I am using globs though raw = sc.textFile("/path/to/dir/*/*")
and I have tons of files so 1 file per partition should not be a problem. On Mon, Oct 20, 2014 at 7:14 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > The biggest danger with gzipped files is this: > > >>> raw = sc.textFile("/path/to/file.gz", 8)>>> raw.getNumPartitions()1 > > You think you’re telling Spark to parallelize the reads on the input, but > Spark cannot parallelize reads against gzipped files. So 1 gzipped file > gets assigned to 1 partition. > > It might be a nice user hint if Spark warned when parallelism is disabled > by the input format. > > Nick > > > On Mon, Oct 20, 2014 at 6:53 PM, Daniel Mahler <dmah...@gmail.com> wrote: > >> Hi Nicholas, >> >> Gzipping is a an impressive guess! Yes, they are. >> My data sets are too large to make repartitioning viable, but I could try >> it on a subset. >> I generally have many more partitions than cores. >> This was happenning before I started setting those configs. >> >> thanks >> Daniel >> >> >> On Mon, Oct 20, 2014 at 5:37 PM, Nicholas Chammas < >> nicholas.cham...@gmail.com> wrote: >> >>> Are you dealing with gzipped files by any chance? Does explicitly >>> repartitioning your RDD to match the number of cores in your cluster help >>> at all? How about if you don't specify the configs you listed and just go >>> with defaults all around? >>> >>> On Mon, Oct 20, 2014 at 5:22 PM, Daniel Mahler <dmah...@gmail.com> >>> wrote: >>> >>>> I launch the cluster using vanilla spark-ec2 scripts. >>>> I just specify the number of slaves and instance type >>>> >>>> On Mon, Oct 20, 2014 at 4:07 PM, Daniel Mahler <dmah...@gmail.com> >>>> wrote: >>>> >>>>> I usually run interactively from the spark-shell. >>>>> My data definitely has more than enough partitions to keep all the >>>>> workers busy. >>>>> When I first launch the cluster I first do: >>>>> >>>>> +++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> cat <<EOF >>~/spark/conf/spark-defaults.conf >>>>> spark.serializer org.apache.spark.serializer.KryoSerializer >>>>> spark.rdd.compress true >>>>> spark.shuffle.consolidateFiles true >>>>> spark.akka.frameSize 20 >>>>> EOF >>>>> >>>>> copy-dir /root/spark/conf >>>>> spark/sbin/stop-all.sh >>>>> sleep 5 >>>>> spark/sbin/start-all.sh >>>>> +++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> >>>>> before starting the spark-shell or running any jobs. >>>>> >>>>> >>>>> >>>>> >>>>> On Mon, Oct 20, 2014 at 2:57 PM, Nicholas Chammas < >>>>> nicholas.cham...@gmail.com> wrote: >>>>> >>>>>> Perhaps your RDD is not partitioned enough to utilize all the cores >>>>>> in your system. >>>>>> >>>>>> Could you post a simple code snippet and explain what kind of >>>>>> parallelism you are seeing for it? And can you report on how many >>>>>> partitions your RDDs have? >>>>>> >>>>>> On Mon, Oct 20, 2014 at 3:53 PM, Daniel Mahler <dmah...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> >>>>>>> I am launching EC2 clusters using the spark-ec2 scripts. >>>>>>> My understanding is that this configures spark to use the available >>>>>>> resources. >>>>>>> I can see that spark will use the available memory on larger istance >>>>>>> types. >>>>>>> However I have never seen spark running at more than 400% (using >>>>>>> 100% on 4 cores) >>>>>>> on machines with many more cores. >>>>>>> Am I misunderstanding the docs? Is it just that high end ec2 >>>>>>> instances get I/O starved when running spark? It would be strange if >>>>>>> that >>>>>>> consistently produced a 400% hard limit though. >>>>>>> >>>>>>> thanks >>>>>>> Daniel >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >