If your data has special characteristics like one small other large then
you can think of doing map side join in Spark using (Broadcast Values),
this will speed up things.
Otherwise as Pitel mentioned if there is nothing special and its just
cartesian product it might take ever, or you might incre
Maybe I'm wrong, but what you are doing here is basically a bunch of
cartesian product for each key. So if "hello" appear 100 times in your
corpus, it will produce 100*100 elements in the join output.
I don't understand what you're doing here, but it's normal your join
takes forever, it makes