[Spark SQL]: Stability of large many-to-many joins

nathan grand Fri, 20 Mar 2020 04:13:25 -0700

Hi,

I have two very large datasets, which both have many repeated keys, which I
wish to join.


A simplified example:

dsA

A_1 |A_2
1 |A
2 |A
3 |A
4 |A
5 |A
1 |B
2 |B
3 |B
1 |C

dsB

B_1 |B_2
A |B
A |C
A |D
A |E
A |F
A |G
B |A
B |E
B |G
B |H
C |A
C |B


The join I want to do is:

dsA.join(dsB, dsA("A_2") === dsB($"B_1"), "INNER")

However, this ends putting a lot of pressure on tasks containing frequently
occurring keys - it's either very, very slow to complete or I encounter
memory issues.

I've played with grouping both sides by the join key prior to joining
(which would make the join one-to-one) but memory seems to become an issue
again as the groups are very large.

Does anyone have any good suggestions as to how to make large many-to-many
joins reliably complete in Spark??

Reliability for me is much more important than speed - this is for a tool
so I can't over-tune to specific data sizes/shapes.

Thanks,

Nathan

[Spark SQL]: Stability of large many-to-many joins

Reply via email to