Thanks Ayan! Finally it worked!! Thanks a lot everyone for the inputs!
Once I prefixed the params with "spark.hadoop", I see the no.of tasks getting reduced. I'm setting the following params: --conf spark.hadoop.dfs.block.size --conf spark.hadoop.mapreduce.input.fileinputformat.split.minsize --conf spark.hadoop.mapreduce.input.fileinputformat.split.maxsize On Tue, Oct 10, 2017 at 1:16 PM, Jörn Franke <jornfra...@gmail.com> wrote: > Maybe you need to set the parameters for the mapreduce api and not the > mapred api. I do not have in mind now how they differ but the Hadoop web > page should tell you ;-) > > On 10. Oct 2017, at 17:53, Kanagha Kumar <kpra...@salesforce.com> wrote: > > Thanks for the inputs!! > > I passed in spark.mapred.max.split.size, spark.mapred.min.split.size to > the size I wanted to read. It didn't take any effect. > I also tried passing in spark.dfs.block.size, with all the params set to > the same value. > > JavaSparkContext.fromSparkContext(spark.sparkContext()).textFile(hdfsPath, > 13); > > Is there any other param that needs to be set as well? > > Thanks > > On Tue, Oct 10, 2017 at 4:32 AM, ayan guha <guha.a...@gmail.com> wrote: > >> I have not tested this, but you should be able to pass on any map-reduce >> like conf to underlying hadoop config.....essentially you should be able to >> control behaviour of split as you can do in a map-reduce program (as Spark >> uses the same input format) >> >> On Tue, Oct 10, 2017 at 10:21 PM, Jörn Franke <jornfra...@gmail.com> >> wrote: >> >>> Write your own input format/datasource or split the file yourself >>> beforehand (not recommended). >>> >>> > On 10. Oct 2017, at 09:14, Kanagha Kumar <kpra...@salesforce.com> >>> wrote: >>> > >>> > Hi, >>> > >>> > I'm trying to read a 60GB HDFS file using spark >>> textFile("hdfs_file_path", minPartitions). >>> > >>> > How can I control the no.of tasks by increasing the split size? With >>> default split size of 250 MB, several tasks are created. But I would like >>> to have a specific no.of tasks created while reading from HDFS itself >>> instead of using repartition() etc., >>> > >>> > Any suggestions are helpful! >>> > >>> > Thanks >>> > >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >>> >> >> >> -- >> Best Regards, >> Ayan Guha >> > > > > -- > > > <http://smart.salesforce.com/sig/kprasad//us_mb/default/link.html> > > -- <http://smart.salesforce.com/sig/kprasad//us_mb/default/link.html>