You could try to force a repartion right at that point by producing a cached version of the DF with .cache() if memory allows it.
On Sun, Jul 1, 2018 at 5:04 AM, Abdeali Kothari <abdealikoth...@gmail.com> wrote: > I've tried that too - it doesn't work. It does a repetition, but not right > after the broadcast join - it does a lot more processing and does the > repetition right before I do my next sortmerge join (stage 12 I described > above) > As the heavy processing is before the sort merge join, it still doesn't > help > > On Sun, Jul 1, 2018, 08:30 yujhe.li <liyu...@gmail.com> wrote: > >> Abdeali Kothari wrote >> > My entire CSV is less than 20KB. >> > By somewhere in between, I do a broadcast join with 3500 records in >> > another >> > file. >> > After the broadcast join I have a lot of processing to do. Overall, the >> > time to process a single record goes up-to 5mins on 1 executor >> > >> > I'm trying to increase the partitions that my data is in so that I have >> at >> > maximum 1 record per executor (currently it sets 2 tasks, and hence 2 >> > executors... I want it to split it into at least 100 tasks at a time so >> I >> > get 5 records per task => ~20min per task) >> >> Maybe you can try repartition(100) after broadcast join, the task number >> should change to 100 for your later transformation. >> >> >> >> -- >> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >>