Re: how about a custom coalesce() policy?

Nezih Yigitbasi Fri, 01 Apr 2016 10:04:13 -0700

Hey Reynold,
Created an issue (and a PR) for this change to get discussions started.


Thanks,
Nezih

On Fri, Feb 26, 2016 at 12:03 AM Reynold Xin <[email protected]> wrote:

> Using the right email for Nezih
>
>
> On Fri, Feb 26, 2016 at 12:01 AM, Reynold Xin <[email protected]> wrote:
>
>> I think this can be useful.
>>
>> The only thing is that we are slowly migrating to the Dataset/DataFrame
>> API, and leave RDD mostly as is as a lower level API. Maybe we should do
>> both? In either case it would be great to discuss the API on a pull
>> request. Cheers.
>>
>> On Wed, Feb 24, 2016 at 2:08 PM, Nezih Yigitbasi <
>> [email protected]> wrote:
>>
>>> Hi Spark devs,
>>>
>>> I have sent an email about my problem some time ago where I want to
>>> merge a large number of small files with Spark. Currently I am using Hive
>>> with the CombineHiveInputFormat and I can control the size of the
>>> output files with the max split size parameter (which is used for
>>> coalescing the input splits by the CombineHiveInputFormat). My first
>>> attempt was to use coalesce(), but since coalesce only considers the
>>> target number of partitions the output file sizes were varying wildly.
>>>
>>> What I think can be useful is to have an optional PartitionCoalescer
>>> parameter (a new interface) in the coalesce() method (or maybe we can
>>> add a new method ?) that the callers can implement for custom coalescing
>>> strategies — for my use case I have already implemented a
>>> SizeBasedPartitionCoalescer that coalesces partitions by looking at
>>> their sizes and by using a max split size parameter, similar to the
>>> CombineHiveInputFormat (I also had to expose HadoopRDD to get access to
>>> the individual split sizes etc.).
>>>
>>> What do you guys think about such a change, can it be useful to other
>>> users as well? Or do you think that there is an easier way to accomplish
>>> the same merge logic? If you think it may be useful, I already have an
>>> implementation and I will be happy to work with the community to contribute
>>> it.
>>>
>>> Thanks,
>>> Nezih
>>> 
>>>
>>
>>
>

Re: how about a custom coalesce() policy?

Reply via email to