Great explanation.

Thanks guys!

Daniel

> On 20 ביולי 2015, at 18:12, Silvio Fiorito <silvio.fior...@granturing.com> 
> wrote:
> 
> Hi Daniel,
> 
> Coalesce, by default will not cause a shuffle. The second parameter when set 
> to true will cause a full shuffle. This is actually what repartition does 
> (calls coalesce with shuffle=true).
> 
> It will attempt to keep colocated partitions together (as you describe) on 
> the same executor. What may happen is you lose data locality if you reduce 
> the partitions to fewer than the number of executors. You obviously also 
> reduce parallelism so you need to be aware of that as you decide when to call 
> coalesce.
> 
> Thanks,
> Silvio
> 
> From: Daniel Haviv
> Date: Monday, July 20, 2015 at 4:59 PM
> To: Doug Balog
> Cc: user
> Subject: Re: Local Repartition
> 
> Thanks Doug,
> coalesce might invoke a shuffle as well. 
> I don't think what I'm suggesting is a feature but it definitely should be.
> 
> Daniel
> 
>> On Mon, Jul 20, 2015 at 4:15 PM, Doug Balog <d...@balog.net> wrote:
>> Hi Daniel,
>> Take a look at .coalesce()
>> I’ve seen good results by coalescing to num executors * 10, but I’m still 
>> trying to figure out the
>> optimal number of partitions per executor.
>> To get the number of executors, 
>> sc.getConf.getInt(“spark.executor.instances”,-1)
>> 
>> 
>> Cheers,
>> 
>> Doug
>> 
>> > On Jul 20, 2015, at 5:04 AM, Daniel Haviv 
>> > <daniel.ha...@veracity-group.com> wrote:
>> >
>> > Hi,
>> > My data is constructed from a lot of small files which results in a lot of 
>> > partitions per RDD.
>> > Is there some way to locally repartition the RDD without shuffling so that 
>> > all of the partitions that reside on a specific node will become X 
>> > partitions on the same node ?
>> >
>> > Thank you.
>> > Daniel
> 

Reply via email to