Thanks Doug, coalesce might invoke a shuffle as well. I don't think what I'm suggesting is a feature but it definitely should be.
Daniel On Mon, Jul 20, 2015 at 4:15 PM, Doug Balog <d...@balog.net> wrote: > Hi Daniel, > Take a look at .coalesce() > I’ve seen good results by coalescing to num executors * 10, but I’m still > trying to figure out the > optimal number of partitions per executor. > To get the number of executors, > sc.getConf.getInt(“spark.executor.instances”,-1) > > > Cheers, > > Doug > > > On Jul 20, 2015, at 5:04 AM, Daniel Haviv < > daniel.ha...@veracity-group.com> wrote: > > > > Hi, > > My data is constructed from a lot of small files which results in a lot > of partitions per RDD. > > Is there some way to locally repartition the RDD without shuffling so > that all of the partitions that reside on a specific node will become X > partitions on the same node ? > > > > Thank you. > > Daniel > >