Re: Optimal way to re-partition from a single partition

2016-02-09 Thread Hemant Bhanawat
Ohk. I was comparing groupBy with orderBy and now I realize that they are using different partitioning schemes. Thanks Takeshi. On Tue, Feb 9, 2016 at 9:09 PM, Takeshi Yamamuro wrote: > Hi, > > DataFrame#sort() uses `RangePartitioning` in `Exchange` instead of > `HashPartitioning`. > `RangePa

Re: Optimal way to re-partition from a single partition

2016-02-09 Thread Takeshi Yamamuro
The issue is not almost solved even in newer Spark. On Wed, Feb 10, 2016 at 1:36 AM, Cesar Flores wrote: > Well, actually I am observing a single partition no matter what my input > is. I am using spark 1.3.1. > > For what you both are saying, it appears that this sorting issue (going to > a si

Re: Optimal way to re-partition from a single partition

2016-02-09 Thread Cesar Flores
Well, actually I am observing a single partition no matter what my input is. I am using spark 1.3.1. For what you both are saying, it appears that this sorting issue (going to a single partition after applying orderBy in a DF) is solved in later version of Spark? Well, if that is the case, I guess

Re: Optimal way to re-partition from a single partition

2016-02-09 Thread Takeshi Yamamuro
Hi, DataFrame#sort() uses `RangePartitioning` in `Exchange` instead of `HashPartitioning`. `RangePartitioning` roughly samples input data and internally computes partition bounds to split given rows into `spark.sql.shuffle.partitions` partitions. Therefore, when sort keys are highly skewed, I thin

Re: Optimal way to re-partition from a single partition

2016-02-09 Thread Hemant Bhanawat
For sql shuffle operations like groupby, the number of output partitions is controlled by spark.sql.shuffle.partitions. But, it seems orderBy does not honour this. In my small test, I could see that the number of partitions in DF returned by orderBy was equal to the total number of distinct keys.

Re: Optimal way to re-partition from a single partition

2016-02-08 Thread Takeshi Yamamuro
Hi, Plz use DataFrame#repartition. On Tue, Feb 9, 2016 at 7:30 AM, Cesar Flores wrote: > > I have a data frame which I sort using orderBy function. This operation > causes my data frame to go to a single partition. After using those > results, I would like to re-partition to a larger number of