Hello Everyone
Do you think it would be useful to add combinedTextFile method (which uses
CombineTextInputFormat) to SparkContext?
It allows one task to read data from multiple text files and control number
of RDD partitions by setting
mapreduce.input.fileinputformat.split.maxsize
def combinedTextFile(sc: SparkContext)(path: String): RDD[String] = {
val conf = sc.hadoopConfiguration
sc.newAPIHadoopFile(path, classOf[CombineTextInputFormat],
classOf[LongWritable], classOf[Text], conf).
map(pair => pair._2.toString).setName(path)
}
Alex