Hi Sid,

Salting is normally a technique to add random characters to existing values.
In big data we can use salting to deal with the skewness.
Salting in join cas be used as :
* Table A-*
Col1, join_col , where join_col values are {x1, x2, x3}
x1
x1
x1
x2
x2
x3

*Table B-*
join_col, Col3 , where join_col  value are {x1, x2}
x1
x2

*Problem: *Let say for table A, data is skewed on x1
Now salting goes like this.  *Salt value =2*
For
*table A, *create a new col with values by salting join col
*New_Join_Col*
x1_1
x1_2
x1_1
x2_1
x2_2
x3_1

For *Table B,*
Change the join_col to all possible values of the sale.
join_col
x1_1
x1_2
x2_1
x2_2

And then join it like
table1.join(table2, where tableA.new_join_col == tableB. join_col)

Let me know if you have any questions.

Regards
Amit Joshi


On Sat, Jul 30, 2022 at 7:16 PM Sid <flinkbyhe...@gmail.com> wrote:

> Hi Team,
>
> I was trying to understand the Salting technique for the column where
> there would be a huge load on a single partition because of the same keys.
>
> I referred to one youtube video with the below understanding:
>
> So, using the salting technique we can actually change the joining column
> values by appending some random number in a specified range.
>
> So, suppose I have these two values in a partition of two different tables:
>
> Table A:
> Partition1:
> x
> .
> .
> .
> x
>
> Table B:
> Partition1:
> x
> .
> .
> .
> x
>
> After Salting it would be something like the below:
>
> Table A:
> Partition1:
> x_1
>
> Partition 2:
> x_2
>
> Table B:
> Partition1:
> x_3
>
> Partition 2:
> x_8
>
> Now, when I inner join these two tables after salting in order to avoid
> data skewness problems, I won't get a match since the keys are different
> after applying salting techniques.
>
> So how does this resolves the data skewness issue or if there is some
> understanding gap?
>
> Could anyone help me in layman's terms?
>
> TIA,
> Sid
>

Reply via email to