Hey Reynold, Created an issue (and a PR) for this change to get discussions started.
Thanks, Nezih On Fri, Feb 26, 2016 at 12:03 AM Reynold Xin <r...@databricks.com> wrote: > Using the right email for Nezih > > > On Fri, Feb 26, 2016 at 12:01 AM, Reynold Xin <r...@databricks.com> wrote: > >> I think this can be useful. >> >> The only thing is that we are slowly migrating to the Dataset/DataFrame >> API, and leave RDD mostly as is as a lower level API. Maybe we should do >> both? In either case it would be great to discuss the API on a pull >> request. Cheers. >> >> On Wed, Feb 24, 2016 at 2:08 PM, Nezih Yigitbasi < >> nyigitb...@netflix.com.invalid> wrote: >> >>> Hi Spark devs, >>> >>> I have sent an email about my problem some time ago where I want to >>> merge a large number of small files with Spark. Currently I am using Hive >>> with the CombineHiveInputFormat and I can control the size of the >>> output files with the max split size parameter (which is used for >>> coalescing the input splits by the CombineHiveInputFormat). My first >>> attempt was to use coalesce(), but since coalesce only considers the >>> target number of partitions the output file sizes were varying wildly. >>> >>> What I think can be useful is to have an optional PartitionCoalescer >>> parameter (a new interface) in the coalesce() method (or maybe we can >>> add a new method ?) that the callers can implement for custom coalescing >>> strategies — for my use case I have already implemented a >>> SizeBasedPartitionCoalescer that coalesces partitions by looking at >>> their sizes and by using a max split size parameter, similar to the >>> CombineHiveInputFormat (I also had to expose HadoopRDD to get access to >>> the individual split sizes etc.). >>> >>> What do you guys think about such a change, can it be useful to other >>> users as well? Or do you think that there is an easier way to accomplish >>> the same merge logic? If you think it may be useful, I already have an >>> implementation and I will be happy to work with the community to contribute >>> it. >>> >>> Thanks, >>> Nezih >>> >>> >> >> >