Hi,

DataFrame#sort() uses `RangePartitioning` in `Exchange` instead of
`HashPartitioning`.
`RangePartitioning` roughly samples input data and internally computes
partition bounds
to split given rows into `spark.sql.shuffle.partitions` partitions.
Therefore, when sort keys are highly skewed, I think some partitions could
end up being empty
(that is, # of result partitions is lower than `spark.sql.shuffle.partitions`
.


On Tue, Feb 9, 2016 at 9:35 PM, Hemant Bhanawat <hemant9...@gmail.com>
wrote:

> For sql shuffle operations like groupby, the number of output partitions
> is controlled by spark.sql.shuffle.partitions. But, it seems orderBy does
> not honour this.
>
> In my small test, I could see that the number of partitions  in DF
> returned by orderBy was equal to the total number of distinct keys. Are you
> observing the same, I mean do you have a single value for all rows in the
> column on which you are running orderBy? If yes, you are better off not
> running the orderBy clause.
>
> May be someone from spark sql team could answer that how should the
> partitioning of the output DF be handled when doing an orderBy?
>
> Hemant
> www.snappydata.io
> https://github.com/SnappyDataInc/snappydata
>
>
>
>
> On Tue, Feb 9, 2016 at 4:00 AM, Cesar Flores <ces...@gmail.com> wrote:
>
>>
>> I have a data frame which I sort using orderBy function. This operation
>> causes my data frame to go to a single partition. After using those
>> results, I would like to re-partition to a larger number of partitions.
>> Currently I am just doing:
>>
>> val rdd = df.rdd.coalesce(100, true) //df is a dataframe with a single
>> partition and around 14 million records
>> val newDF =  hc.createDataFrame(rdd, df.schema)
>>
>> This process is really slow. Is there any other way of achieving this
>> task, or to optimize it (perhaps tweaking a spark configuration parameter)?
>>
>>
>> Thanks a lot
>> --
>> Cesar Flores
>>
>
>


-- 
---
Takeshi Yamamuro

Reply via email to