Great explanation. Thanks guys!
Daniel > On 20 ביולי 2015, at 18:12, Silvio Fiorito <silvio.fior...@granturing.com> > wrote: > > Hi Daniel, > > Coalesce, by default will not cause a shuffle. The second parameter when set > to true will cause a full shuffle. This is actually what repartition does > (calls coalesce with shuffle=true). > > It will attempt to keep colocated partitions together (as you describe) on > the same executor. What may happen is you lose data locality if you reduce > the partitions to fewer than the number of executors. You obviously also > reduce parallelism so you need to be aware of that as you decide when to call > coalesce. > > Thanks, > Silvio > > From: Daniel Haviv > Date: Monday, July 20, 2015 at 4:59 PM > To: Doug Balog > Cc: user > Subject: Re: Local Repartition > > Thanks Doug, > coalesce might invoke a shuffle as well. > I don't think what I'm suggesting is a feature but it definitely should be. > > Daniel > >> On Mon, Jul 20, 2015 at 4:15 PM, Doug Balog <d...@balog.net> wrote: >> Hi Daniel, >> Take a look at .coalesce() >> I’ve seen good results by coalescing to num executors * 10, but I’m still >> trying to figure out the >> optimal number of partitions per executor. >> To get the number of executors, >> sc.getConf.getInt(“spark.executor.instances”,-1) >> >> >> Cheers, >> >> Doug >> >> > On Jul 20, 2015, at 5:04 AM, Daniel Haviv >> > <daniel.ha...@veracity-group.com> wrote: >> > >> > Hi, >> > My data is constructed from a lot of small files which results in a lot of >> > partitions per RDD. >> > Is there some way to locally repartition the RDD without shuffling so that >> > all of the partitions that reside on a specific node will become X >> > partitions on the same node ? >> > >> > Thank you. >> > Daniel >