I prefer not to do a .cache() due to memory limits. But I did try a persist() with DISK_ONLY
I did the repartition(), followed by a .count() followed by a persist() of DISK_ONLY That didn't change the number of tasks either On Sun, Jul 1, 2018, 15:50 Alexander Czech <alexander.cz...@googlemail.com> wrote: > You could try to force a repartion right at that point by producing a > cached version of the DF with .cache() if memory allows it. > > On Sun, Jul 1, 2018 at 5:04 AM, Abdeali Kothari <abdealikoth...@gmail.com> > wrote: > >> I've tried that too - it doesn't work. It does a repetition, but not >> right after the broadcast join - it does a lot more processing and does the >> repetition right before I do my next sortmerge join (stage 12 I described >> above) >> As the heavy processing is before the sort merge join, it still doesn't >> help >> >> On Sun, Jul 1, 2018, 08:30 yujhe.li <liyu...@gmail.com> wrote: >> >>> Abdeali Kothari wrote >>> > My entire CSV is less than 20KB. >>> > By somewhere in between, I do a broadcast join with 3500 records in >>> > another >>> > file. >>> > After the broadcast join I have a lot of processing to do. Overall, the >>> > time to process a single record goes up-to 5mins on 1 executor >>> > >>> > I'm trying to increase the partitions that my data is in so that I >>> have at >>> > maximum 1 record per executor (currently it sets 2 tasks, and hence 2 >>> > executors... I want it to split it into at least 100 tasks at a time >>> so I >>> > get 5 records per task => ~20min per task) >>> >>> Maybe you can try repartition(100) after broadcast join, the task number >>> should change to 100 for your later transformation. >>> >>> >>> >>> -- >>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >>> >