Re: combitedTextFile and CombineTextInputFormat

Reynold Xin Thu, 19 May 2016 08:44:06 -0700

It is different isn't it. Whole text files returns one element per file,
whereas combined inout format is similar to coalescing partitions to bin
pack into a certain size.


On Thursday, May 19, 2016, Xiangrui Meng <men...@gmail.com> wrote:

> This was implemented as sc.wholeTextFiles.
>
> On Thu, May 19, 2016, 2:43 AM Reynold Xin <r...@databricks.com
> <javascript:_e(%7B%7D,'cvml','r...@databricks.com');>> wrote:
>
>> Users would be able to run this already with the 3 lines of code you
>> supplied right? In general there are a lot of methods already on
>> SparkContext and we lean towards the more conservative side in introducing
>> new API variants.
>>
>> Note that this is something we are doing automatically in Spark SQL for
>> file sources (Dataset/DataFrame).
>>
>>
>> On Sat, May 14, 2016 at 8:13 PM, Alexander Pivovarov <
>> apivova...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','apivova...@gmail.com');>> wrote:
>>
>>> Hello Everyone
>>>
>>> Do you think it would be useful to add combinedTextFile method (which
>>> uses CombineTextInputFormat) to SparkContext?
>>>
>>> It allows one task to read data from multiple text files and control
>>> number of RDD partitions by setting
>>> mapreduce.input.fileinputformat.split.maxsize
>>>
>>>
>>>   def combinedTextFile(sc: SparkContext)(path: String): RDD[String] = {
>>>     val conf = sc.hadoopConfiguration
>>>     sc.newAPIHadoopFile(path, classOf[CombineTextInputFormat],
>>> classOf[LongWritable], classOf[Text], conf).
>>>       map(pair => pair._2.toString).setName(path)
>>>   }
>>>
>>>
>>> Alex
>>>
>>
>>

Re: combitedTextFile and CombineTextInputFormat

Reply via email to