It is different isn't it. Whole text files returns one element per file, whereas combined inout format is similar to coalescing partitions to bin pack into a certain size.
On Thursday, May 19, 2016, Xiangrui Meng <men...@gmail.com> wrote: > This was implemented as sc.wholeTextFiles. > > On Thu, May 19, 2016, 2:43 AM Reynold Xin <r...@databricks.com > <javascript:_e(%7B%7D,'cvml','r...@databricks.com');>> wrote: > >> Users would be able to run this already with the 3 lines of code you >> supplied right? In general there are a lot of methods already on >> SparkContext and we lean towards the more conservative side in introducing >> new API variants. >> >> Note that this is something we are doing automatically in Spark SQL for >> file sources (Dataset/DataFrame). >> >> >> On Sat, May 14, 2016 at 8:13 PM, Alexander Pivovarov < >> apivova...@gmail.com >> <javascript:_e(%7B%7D,'cvml','apivova...@gmail.com');>> wrote: >> >>> Hello Everyone >>> >>> Do you think it would be useful to add combinedTextFile method (which >>> uses CombineTextInputFormat) to SparkContext? >>> >>> It allows one task to read data from multiple text files and control >>> number of RDD partitions by setting >>> mapreduce.input.fileinputformat.split.maxsize >>> >>> >>> def combinedTextFile(sc: SparkContext)(path: String): RDD[String] = { >>> val conf = sc.hadoopConfiguration >>> sc.newAPIHadoopFile(path, classOf[CombineTextInputFormat], >>> classOf[LongWritable], classOf[Text], conf). >>> map(pair => pair._2.toString).setName(path) >>> } >>> >>> >>> Alex >>> >> >>