Hi, I have two very large datasets, which both have many repeated keys, which I wish to join.
A simplified example: dsA A_1 |A_2 1 |A 2 |A 3 |A 4 |A 5 |A 1 |B 2 |B 3 |B 1 |C dsB B_1 |B_2 A |B A |C A |D A |E A |F A |G B |A B |E B |G B |H C |A C |B The join I want to do is: dsA.join(dsB, dsA("A_2") === dsB($"B_1"), "INNER") However, this ends putting a lot of pressure on tasks containing frequently occurring keys - it's either very, very slow to complete or I encounter memory issues. I've played with grouping both sides by the join key prior to joining (which would make the join one-to-one) but memory seems to become an issue again as the groups are very large. Does anyone have any good suggestions as to how to make large many-to-many joins reliably complete in Spark?? Reliability for me is much more important than speed - this is for a tool so I can't over-tune to specific data sizes/shapes. Thanks, Nathan