In Spark Streaming, StreamContext.fileStream gives a FileInputDStream. Within each batch interval, it would launch map tasks for the new files detected during that interval. It appears that the way Spark compute the number of map tasks is based oo block size of files.
Below is the quote from Spark documentation. Spark automatically sets the number of “map” tasks to run on each file according to its size (though you can control it through optional parameters to SparkContext.textFile, etc) In my testing, if files are loaded as 512M blocks, each map task seems to process 512M chunk of data, no matter what value I set dfs.blocksize on driver/executor. I am wondering if there is a way to increase parallelism, say let each map read 128M data and increase the number of map tasks? -- Chen Song