How big is your file? it's probably of a size that the Hadoop InputFormat would make 52 splits for it. Data drives partitions, not processing resource. Really, 8 splits is the minimum parallelism you want. Several times your # of cores is better.
On Fri, Dec 5, 2014 at 8:51 AM, Jaonary Rabarisoa <[email protected]> wrote: > Hi all, > > I'm trying to run some spark job with spark-shell. What I want to do is just > to count the number of lines in a file. > I start the spark-shell with the default argument i.e just with > ./bin/spark-shell. > > Load the text file with sc.textFile("path") and then call count on my data. > > When I do this, my data is always split in 52 partitions. I don't understand > why since I run it on a local machine with 8 cores and the > sc.defaultParallelism gives me 8. > > Even, if I load the file with sc.textFile("path",8), I always get > data.partitions.size = 52 > > I use spark 1.1.1. > > > Any ideas ? > > > > Cheers, > > Jao > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
