The key is this line from Amit's email (emphasis added): > Change the join_col to *all possible values* of the sale.
The two tables are treated asymmetrically: 1. The skewed table gets random salts appended to the join key. 2. The other table gets all possible salts appended to the join key (e.g. using a range array literal + explode). Thus guarantees that every row in the skewed table will match a row in the other table. This StackOverflow answer <https://stackoverflow.com/a/57951114/1892435> gives an example. Op zo 31 jul. 2022 om 10:41 schreef Amit Joshi <mailtojoshia...@gmail.com>: > Hi Sid, > > I am not sure I understood your question. > But the keys cannot be different post salting in both the tables, this is > what i have shown in the explanation. > You salt Table A and then explode Table B to create all possible values. > > In your case, I do not understand, what Table B has x_8/9. It should be > all possible values which you used to create salt. > > I hope you understand. > > Thanks > > > > On Sun, Jul 31, 2022 at 10:02 AM Sid <flinkbyhe...@gmail.com> wrote: > >> Hi Amit, >> >> Thanks for your reply. However, your answer doesn't seem different from >> what I have explained. >> >> My question is after salting if the keys are different like in my example >> then post join there would be no results assuming the join type as inner >> join because even though the keys are segregated in different partitions >> based on unique keys they are not matching because x_1/x_2 !=x_8/x_9 >> >> How do you ensure that the results are matched? >> >> Best, >> Sid >> >> On Sun, Jul 31, 2022 at 1:34 AM Amit Joshi <mailtojoshia...@gmail.com> >> wrote: >> >>> Hi Sid, >>> >>> Salting is normally a technique to add random characters to existing >>> values. >>> In big data we can use salting to deal with the skewness. >>> Salting in join cas be used as : >>> * Table A-* >>> Col1, join_col , where join_col values are {x1, x2, x3} >>> x1 >>> x1 >>> x1 >>> x2 >>> x2 >>> x3 >>> >>> *Table B-* >>> join_col, Col3 , where join_col value are {x1, x2} >>> x1 >>> x2 >>> >>> *Problem: *Let say for table A, data is skewed on x1 >>> Now salting goes like this. *Salt value =2* >>> For >>> *table A, *create a new col with values by salting join col >>> *New_Join_Col* >>> x1_1 >>> x1_2 >>> x1_1 >>> x2_1 >>> x2_2 >>> x3_1 >>> >>> For *Table B,* >>> Change the join_col to all possible values of the sale. >>> join_col >>> x1_1 >>> x1_2 >>> x2_1 >>> x2_2 >>> >>> And then join it like >>> table1.join(table2, where tableA.new_join_col == tableB. join_col) >>> >>> Let me know if you have any questions. >>> >>> Regards >>> Amit Joshi >>> >>> >>> On Sat, Jul 30, 2022 at 7:16 PM Sid <flinkbyhe...@gmail.com> wrote: >>> >>>> Hi Team, >>>> >>>> I was trying to understand the Salting technique for the column where >>>> there would be a huge load on a single partition because of the same keys. >>>> >>>> I referred to one youtube video with the below understanding: >>>> >>>> So, using the salting technique we can actually change the joining >>>> column values by appending some random number in a specified range. >>>> >>>> So, suppose I have these two values in a partition of two different >>>> tables: >>>> >>>> Table A: >>>> Partition1: >>>> x >>>> . >>>> . >>>> . >>>> x >>>> >>>> Table B: >>>> Partition1: >>>> x >>>> . >>>> . >>>> . >>>> x >>>> >>>> After Salting it would be something like the below: >>>> >>>> Table A: >>>> Partition1: >>>> x_1 >>>> >>>> Partition 2: >>>> x_2 >>>> >>>> Table B: >>>> Partition1: >>>> x_3 >>>> >>>> Partition 2: >>>> x_8 >>>> >>>> Now, when I inner join these two tables after salting in order to avoid >>>> data skewness problems, I won't get a match since the keys are different >>>> after applying salting techniques. >>>> >>>> So how does this resolves the data skewness issue or if there is some >>>> understanding gap? >>>> >>>> Could anyone help me in layman's terms? >>>> >>>> TIA, >>>> Sid >>>> >>>