Hi Shreya Initial partitions in the Datasets were more than 1000 and after a group by operation, the resultant Dataset had only 200 partitions (because by default number of partitions being set to 200). Any further operations on the resultant Dataset will have a maximum of 200 parallelism resulting in inefficient use of cluster.
I am performing multiple join & group by operations on Datasets that are huge (8TB+) and low parallelism severely affects the time it takes to run the data pipeline. The workaround that sets sparkSession.conf.set(SQLConf.SHUFFLE_PARTITIONS.key, <num partitions>) works but it would be ideal to set partitions on a per join/group by operation basis, like we could using the RDD API. Thanks, Aniket On Fri, Nov 11, 2016 at 6:27 PM Shreya Agarwal <shrey...@microsoft.com> wrote: > Curious – why do you want to repartition? Is there a subsequent step which > fails because the number of partitions is less? Or you want to do it for a > perf gain? > > > > Also, what were your initial Dataset partitions and how many did you have > for the result of join? > > > > *From:* Aniket Bhatnagar [mailto:aniket.bhatna...@gmail.com] > *Sent:* Friday, November 11, 2016 9:22 AM > *To:* user <user@spark.apache.org> > *Subject:* Dataset API | Setting number of partitions during join/groupBy > > > > Hi > > > > I can't seem to find a way to pass number of partitions while join 2 > Datasets or doing a groupBy operation on the Dataset. There is an option of > repartitioning the resultant Dataset but it's inefficient to repartition > after the Dataset has been joined/grouped into default number of > partitions. With RDD API, this was easy to do as the functions accepted a > numPartitions parameter. The only way to do this seems to be > sparkSession.conf.set(SQLConf.SHUFFLE_PARTITIONS.key, <num partitions>) but > this means that all join/groupBy operations going forward will have the > same number of partitions. > > > > Thanks, > > Aniket >