subject:"combitedTextFile and CombineTextInputFormat"

Re: combitedTextFile and CombineTextInputFormat

2016-05-19 Thread Saisai Shao

Hi Alex, >From my understanding the community is shifting the effort from RDD based APIs to Dataset/DataFrame based ones, so for me it is not so necessary to add a new RDD based API as I mentioned before. Also for the problem of so many partitions, I think there're many other solutions to handle i

Re: combitedTextFile and CombineTextInputFormat

2016-05-19 Thread Alexander Pivovarov

Saisai, Reynold, Thank you for your replies. I also think that many variation of textFile() methods might be confusing for users. Better to have just one good textFile() implementation. Do you think sc.textFile() should use CombineTextInputFormat instead of TextInputFormat? CombineTextInputForma

Re: combitedTextFile and CombineTextInputFormat

2016-05-19 Thread Saisai Shao

>From my understanding I think newAPIHadoopFile or hadoopFIle is generic enough for you to support any InputFormat you wanted. IMO it is not so necessary to add a new API for this. On Fri, May 20, 2016 at 12:59 AM, Alexander Pivovarov wrote: > Spark users might not know about CombineTextInputFor

Re: combitedTextFile and CombineTextInputFormat

2016-05-19 Thread Alexander Pivovarov

Spark users might not know about CombineTextInputFormat. They probably think that sc.textFile already implements the best way to read text files. I think CombineTextInputFormat can replace regular TextInputFormat in most of the cases. Maybe Spark 2.0 can use CombineTextInputFormat in sc.textFile ?

Re: combitedTextFile and CombineTextInputFormat

2016-05-19 Thread Xiangrui Meng

Not exacly the same as the one you suggested but you can chain it with flatMap to get what you want, if each file is not huge. On Thu, May 19, 2016, 8:41 AM Xiangrui Meng wrote: > This was implemented as sc.wholeTextFiles. > > On Thu, May 19, 2016, 2:43 AM Reynold Xin wrote: > >> Users would be

Re: combitedTextFile and CombineTextInputFormat

2016-05-19 Thread Reynold Xin

It is different isn't it. Whole text files returns one element per file, whereas combined inout format is similar to coalescing partitions to bin pack into a certain size. On Thursday, May 19, 2016, Xiangrui Meng wrote: > This was implemented as sc.wholeTextFiles. > > On Thu, May 19, 2016, 2:43

Re: combitedTextFile and CombineTextInputFormat

2016-05-19 Thread Xiangrui Meng

This was implemented as sc.wholeTextFiles. On Thu, May 19, 2016, 2:43 AM Reynold Xin wrote: > Users would be able to run this already with the 3 lines of code you > supplied right? In general there are a lot of methods already on > SparkContext and we lean towards the more conservative side in i

Re: combitedTextFile and CombineTextInputFormat

2016-05-19 Thread Reynold Xin

Users would be able to run this already with the 3 lines of code you supplied right? In general there are a lot of methods already on SparkContext and we lean towards the more conservative side in introducing new API variants. Note that this is something we are doing automatically in Spark SQL for

combitedTextFile and CombineTextInputFormat

2016-05-14 Thread Alexander Pivovarov

Hello Everyone Do you think it would be useful to add combinedTextFile method (which uses CombineTextInputFormat) to SparkContext? It allows one task to read data from multiple text files and control number of RDD partitions by setting mapreduce.input.fileinputformat.split.maxsize def combine

Re: combitedTextFile and CombineTextInputFormat

Re: combitedTextFile and CombineTextInputFormat

Re: combitedTextFile and CombineTextInputFormat

Re: combitedTextFile and CombineTextInputFormat

Re: combitedTextFile and CombineTextInputFormat

Re: combitedTextFile and CombineTextInputFormat

Re: combitedTextFile and CombineTextInputFormat

Re: combitedTextFile and CombineTextInputFormat

combitedTextFile and CombineTextInputFormat

9 matches

Site Navigation

Mail list logo

Footer information