Re: Dataset API | Setting number of partitions during join/groupBy

Aniket Bhatnagar Fri, 11 Nov 2016 10:35:46 -0800

Hi Shreya

Initial partitions in the Datasets were more than 1000 and after a group by
operation, the resultant Dataset had only 200 partitions (because by
default number of partitions being set to 200). Any further operations on
the resultant Dataset will have a maximum of 200 parallelism resulting in
inefficient use of cluster.


I am performing multiple join & group by operations on Datasets that are
huge (8TB+) and low parallelism severely affects the time it takes to run
the data pipeline. The workaround that
sets sparkSession.conf.set(SQLConf.SHUFFLE_PARTITIONS.key, <num
partitions>) works but it would be ideal to set partitions on a per
join/group by operation basis, like we could using the RDD API.

Thanks,
Aniket

On Fri, Nov 11, 2016 at 6:27 PM Shreya Agarwal <shrey...@microsoft.com>
wrote:

> Curious – why do you want to repartition? Is there a subsequent step which
> fails because the number of partitions is less? Or you want to do it for a
> perf gain?
>
>
>
> Also, what were your initial Dataset partitions and how many did you have
> for the result of join?
>
>
>
> *From:* Aniket Bhatnagar [mailto:aniket.bhatna...@gmail.com]
> *Sent:* Friday, November 11, 2016 9:22 AM
> *To:* user <user@spark.apache.org>
> *Subject:* Dataset API | Setting number of partitions during join/groupBy
>
>
>
> Hi
>
>
>
> I can't seem to find a way to pass number of partitions while join 2
> Datasets or doing a groupBy operation on the Dataset. There is an option of
> repartitioning the resultant Dataset but it's inefficient to repartition
> after the Dataset has been joined/grouped into default number of
> partitions. With RDD API, this was easy to do as the functions accepted a
> numPartitions parameter. The only way to do this seems to be
> sparkSession.conf.set(SQLConf.SHUFFLE_PARTITIONS.key, <num partitions>) but
> this means that all join/groupBy operations going forward will have the
> same number of partitions.
>
>
>
> Thanks,
>
> Aniket
>

Re: Dataset API | Setting number of partitions during join/groupBy

Reply via email to