Re: increase parallelism of reading from hdfs

2014-08-11 Thread Chen Song
could probably set the system property > "mapreduce.input.fileinputformat.split.maxsize". > > Regards, > Paul Hamilton > > From: Chen Song > Date: Friday, August 8, 2014 at 9:13 PM > To: "user@spark.apache.org" > Subject: increase parallelism of reading from hdfs > > > In S

Re: increase parallelism of reading from hdfs

2014-08-11 Thread Paul Hamilton
To: "user@spark.apache.org" Subject: increase parallelism of reading from hdfs In Spark Streaming, StreamContext.fileStream gives a FileInputDStream. Within each batch interval, it would launch map tasks for the new files detected during that interval. It appears that the way Spark compu

increase parallelism of reading from hdfs

2014-08-08 Thread Chen Song
In Spark Streaming, StreamContext.fileStream gives a FileInputDStream. Within each batch interval, it would launch map tasks for the new files detected during that interval. It appears that the way Spark compute the number of map tasks is based oo block size of files. Below is the quote from Spark