Two options, either add salting to your join or filter records that are
frequent, join them separately and the union back, it's the skew join issue.

On Fri, Mar 20, 2020 at 4:12 AM nathan grand <nathanwilliamgr...@gmail.com>
wrote:

> Hi,
>
> I have two very large datasets, which both have many repeated keys, which I
> wish to join.
>
> A simplified example:
>
> dsA
>
> A_1 |A_2
> 1 |A
> 2 |A
> 3 |A
> 4 |A
> 5 |A
> 1 |B
> 2 |B
> 3 |B
> 1 |C
>
> dsB
>
> B_1 |B_2
> A |B
> A |C
> A |D
> A |E
> A |F
> A |G
> B |A
> B |E
> B |G
> B |H
> C |A
> C |B
>
>
> The join I want to do is:
>
> dsA.join(dsB, dsA("A_2") === dsB($"B_1"), "INNER")
>
> However, this ends putting a lot of pressure on tasks containing frequently
> occurring keys - it's either very, very slow to complete or I encounter
> memory issues.
>
> I've played with grouping both sides by the join key prior to joining
> (which would make the join one-to-one) but memory seems to become an issue
> again as the groups are very large.
>
> Does anyone have any good suggestions as to how to make large many-to-many
> joins reliably complete in Spark??
>
> Reliability for me is much more important than speed - this is for a tool
> so I can't over-tune to specific data sizes/shapes.
>
> Thanks,
>
> Nathan
>

Reply via email to