sc.textFile will use the Hadoop TextInputFormat (I believe), this will use the Hadoop block size to read records from HDFS. Most likely the block size is 128MB. Not sure you can do anything about the number of tasks generated to read from HDFS. ------------------------------------------------------------------------------- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-action>
> On 25 Jul 2017, at 13:21, Gokula Krishnan D <email2...@gmail.com> wrote: > > In addition to that, > > tried to read the same file with 3000 partitions but it used 3070 partitions. > And took more time than previous please refer the attachment. > > Thanks & Regards, > Gokula Krishnan (Gokul) > > On Tue, Jul 25, 2017 at 8:15 AM, Gokula Krishnan D <email2...@gmail.com > <mailto:email2...@gmail.com>> wrote: > Hello All, > > I have a HDFS file with approx. 1.5 Billion records with 500 Part files > (258.2GB Size) and when I tried to execute the following I could see that it > used 2290 tasks but it supposed to be 500 as like HDFS File, isn't it? > > val inputFile = <HDFS File> > val inputRdd = sc.textFile(inputFile) > inputRdd.count() > > I was hoping that I can do the same with the fewer partitions so tried the > following > > val inputFile = <HDFS File> > val inputrddnqew = sc.textFile(inputFile,500) > inputRddNew.count() > > But still it used 2290 tasks. > > As per scala doc, it supposed use as like the HDFS file i.e 500. > > It would be great if you could throw some insight on this. > > Thanks & Regards, > Gokula Krishnan (Gokul) > > <Screen Shot 2017-07-25 at 8.20.58 AM.png> > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > <mailto:user-unsubscr...@spark.apache.org>