Re: combitedTextFile and CombineTextInputFormat

Saisai Shao Thu, 19 May 2016 22:24:41 -0700

Hi Alex,

>From my understanding the community is shifting the effort from RDD based
APIs to Dataset/DataFrame based ones, so for me it is not so necessary to
add a new RDD based API as I mentioned before. Also for the problem of so
many partitions, I think there're many other solutions to handle it.


Of course it is just my own thought.

Thanks
Saisai

On Fri, May 20, 2016 at 1:15 PM, Alexander Pivovarov <apivova...@gmail.com>
wrote:

> Saisai, Reynold,
>
> Thank you for your replies.
> I also think that many variation of textFile() methods might be confusing
> for users. Better to have just one good textFile() implementation.
>
> Do you think sc.textFile() should use CombineTextInputFormat instead
> of TextInputFormat?
>
> CombineTextInputFormat allows users to control number of partitions in
> RDD (control split size)
> It's useful for real workloads (e.g. 100 folders, 200,000 files, all files
> are different size, e.g. 100KB - 500MB, total 4TB)
>
> if we use current implementation of sc.textFile() it will generate RDD
> with 250,000+ partitions (one partition for each small file, several
> partitions for big files).
>
> Using CombineTextInputFormat allows us to control number of partitions and
> split size by settign mapreduce.input.fileinputformat.split.maxsize
> property. e.g. if we set it to 256MB spark will generate RDD with ~20,000
> partitions.
>
> It's better to have RDD with 20,000 partitions by 256MB than RDD with
> 250,000+ partition all different sizes from 100KB to 128MB
>
> So, I see only advantages if sc.textFile() starts using CombineTextInputFormat
> instead of TextInputFormat
>
> Alex
>
> On Thu, May 19, 2016 at 8:30 PM, Saisai Shao <sai.sai.s...@gmail.com>
> wrote:
>
>> From my understanding I think newAPIHadoopFile or hadoopFIle is generic
>> enough for you to support any InputFormat you wanted. IMO it is not so
>> necessary to add a new API for this.
>>
>> On Fri, May 20, 2016 at 12:59 AM, Alexander Pivovarov <
>> apivova...@gmail.com> wrote:
>>
>>> Spark users might not know about CombineTextInputFormat. They probably
>>> think that sc.textFile already implements the best way to read text files.
>>>
>>> I think CombineTextInputFormat can replace regular TextInputFormat in
>>> most of the cases.
>>> Maybe Spark 2.0 can use CombineTextInputFormat in sc.textFile ?
>>> On May 19, 2016 2:43 AM, "Reynold Xin" <r...@databricks.com> wrote:
>>>
>>>> Users would be able to run this already with the 3 lines of code you
>>>> supplied right? In general there are a lot of methods already on
>>>> SparkContext and we lean towards the more conservative side in introducing
>>>> new API variants.
>>>>
>>>> Note that this is something we are doing automatically in Spark SQL for
>>>> file sources (Dataset/DataFrame).
>>>>
>>>>
>>>> On Sat, May 14, 2016 at 8:13 PM, Alexander Pivovarov <
>>>> apivova...@gmail.com> wrote:
>>>>
>>>>> Hello Everyone
>>>>>
>>>>> Do you think it would be useful to add combinedTextFile method (which
>>>>> uses CombineTextInputFormat) to SparkContext?
>>>>>
>>>>> It allows one task to read data from multiple text files and control
>>>>> number of RDD partitions by setting
>>>>> mapreduce.input.fileinputformat.split.maxsize
>>>>>
>>>>>
>>>>>   def combinedTextFile(sc: SparkContext)(path: String): RDD[String] = {
>>>>>     val conf = sc.hadoopConfiguration
>>>>>     sc.newAPIHadoopFile(path, classOf[CombineTextInputFormat],
>>>>> classOf[LongWritable], classOf[Text], conf).
>>>>>       map(pair => pair._2.toString).setName(path)
>>>>>   }
>>>>>
>>>>>
>>>>> Alex
>>>>>
>>>>
>>>>
>>
>

Re: combitedTextFile and CombineTextInputFormat

Reply via email to