Hi Sid, Salting is normally a technique to add random characters to existing values. In big data we can use salting to deal with the skewness. Salting in join cas be used as : * Table A-* Col1, join_col , where join_col values are {x1, x2, x3} x1 x1 x1 x2 x2 x3
*Table B-* join_col, Col3 , where join_col value are {x1, x2} x1 x2 *Problem: *Let say for table A, data is skewed on x1 Now salting goes like this. *Salt value =2* For *table A, *create a new col with values by salting join col *New_Join_Col* x1_1 x1_2 x1_1 x2_1 x2_2 x3_1 For *Table B,* Change the join_col to all possible values of the sale. join_col x1_1 x1_2 x2_1 x2_2 And then join it like table1.join(table2, where tableA.new_join_col == tableB. join_col) Let me know if you have any questions. Regards Amit Joshi On Sat, Jul 30, 2022 at 7:16 PM Sid <flinkbyhe...@gmail.com> wrote: > Hi Team, > > I was trying to understand the Salting technique for the column where > there would be a huge load on a single partition because of the same keys. > > I referred to one youtube video with the below understanding: > > So, using the salting technique we can actually change the joining column > values by appending some random number in a specified range. > > So, suppose I have these two values in a partition of two different tables: > > Table A: > Partition1: > x > . > . > . > x > > Table B: > Partition1: > x > . > . > . > x > > After Salting it would be something like the below: > > Table A: > Partition1: > x_1 > > Partition 2: > x_2 > > Table B: > Partition1: > x_3 > > Partition 2: > x_8 > > Now, when I inner join these two tables after salting in order to avoid > data skewness problems, I won't get a match since the keys are different > after applying salting techniques. > > So how does this resolves the data skewness issue or if there is some > understanding gap? > > Could anyone help me in layman's terms? > > TIA, > Sid >