Ohk. I was comparing groupBy with orderBy and now I realize that they are
using different partitioning schemes.
Thanks Takeshi.
On Tue, Feb 9, 2016 at 9:09 PM, Takeshi Yamamuro
wrote:
> Hi,
>
> DataFrame#sort() uses `RangePartitioning` in `Exchange` instead of
> `HashPartitioning`.
> `RangePa
The issue is not almost solved even in newer Spark.
On Wed, Feb 10, 2016 at 1:36 AM, Cesar Flores wrote:
> Well, actually I am observing a single partition no matter what my input
> is. I am using spark 1.3.1.
>
> For what you both are saying, it appears that this sorting issue (going to
> a si
Well, actually I am observing a single partition no matter what my input
is. I am using spark 1.3.1.
For what you both are saying, it appears that this sorting issue (going to
a single partition after applying orderBy in a DF) is solved in later
version of Spark? Well, if that is the case, I guess
Hi,
DataFrame#sort() uses `RangePartitioning` in `Exchange` instead of
`HashPartitioning`.
`RangePartitioning` roughly samples input data and internally computes
partition bounds
to split given rows into `spark.sql.shuffle.partitions` partitions.
Therefore, when sort keys are highly skewed, I thin
For sql shuffle operations like groupby, the number of output partitions is
controlled by spark.sql.shuffle.partitions. But, it seems orderBy does not
honour this.
In my small test, I could see that the number of partitions in DF returned
by orderBy was equal to the total number of distinct keys.
Hi,
Plz use DataFrame#repartition.
On Tue, Feb 9, 2016 at 7:30 AM, Cesar Flores wrote:
>
> I have a data frame which I sort using orderBy function. This operation
> causes my data frame to go to a single partition. After using those
> results, I would like to re-partition to a larger number of