One option is create a separate column in table A with salting. Use it as partition key. Use original column for joining.
Ayan On Sun, 31 Jul 2022 at 6:45 pm, Jacob Lynn <abebopare...@gmail.com> wrote: > The key is this line from Amit's email (emphasis added): > > > Change the join_col to *all possible values* of the sale. > > The two tables are treated asymmetrically: > > 1. The skewed table gets random salts appended to the join key. > 2. The other table gets all possible salts appended to the join key (e.g. > using a range array literal + explode). > > Thus guarantees that every row in the skewed table will match a row in the > other table. This StackOverflow answer > <https://stackoverflow.com/a/57951114/1892435> gives an example. > > Op zo 31 jul. 2022 om 10:41 schreef Amit Joshi <mailtojoshia...@gmail.com > >: > >> Hi Sid, >> >> I am not sure I understood your question. >> But the keys cannot be different post salting in both the tables, this is >> what i have shown in the explanation. >> You salt Table A and then explode Table B to create all possible values. >> >> In your case, I do not understand, what Table B has x_8/9. It should be >> all possible values which you used to create salt. >> >> I hope you understand. >> >> Thanks >> >> >> >> On Sun, Jul 31, 2022 at 10:02 AM Sid <flinkbyhe...@gmail.com> wrote: >> >>> Hi Amit, >>> >>> Thanks for your reply. However, your answer doesn't seem different from >>> what I have explained. >>> >>> My question is after salting if the keys are different like in my >>> example then post join there would be no results assuming the join type as >>> inner join because even though the keys are segregated in different >>> partitions based on unique keys they are not matching because x_1/x_2 >>> !=x_8/x_9 >>> >>> How do you ensure that the results are matched? >>> >>> Best, >>> Sid >>> >>> On Sun, Jul 31, 2022 at 1:34 AM Amit Joshi <mailtojoshia...@gmail.com> >>> wrote: >>> >>>> Hi Sid, >>>> >>>> Salting is normally a technique to add random characters to existing >>>> values. >>>> In big data we can use salting to deal with the skewness. >>>> Salting in join cas be used as : >>>> * Table A-* >>>> Col1, join_col , where join_col values are {x1, x2, x3} >>>> x1 >>>> x1 >>>> x1 >>>> x2 >>>> x2 >>>> x3 >>>> >>>> *Table B-* >>>> join_col, Col3 , where join_col value are {x1, x2} >>>> x1 >>>> x2 >>>> >>>> *Problem: *Let say for table A, data is skewed on x1 >>>> Now salting goes like this. *Salt value =2* >>>> For >>>> *table A, *create a new col with values by salting join col >>>> *New_Join_Col* >>>> x1_1 >>>> x1_2 >>>> x1_1 >>>> x2_1 >>>> x2_2 >>>> x3_1 >>>> >>>> For *Table B,* >>>> Change the join_col to all possible values of the sale. >>>> join_col >>>> x1_1 >>>> x1_2 >>>> x2_1 >>>> x2_2 >>>> >>>> And then join it like >>>> table1.join(table2, where tableA.new_join_col == tableB. join_col) >>>> >>>> Let me know if you have any questions. >>>> >>>> Regards >>>> Amit Joshi >>>> >>>> >>>> On Sat, Jul 30, 2022 at 7:16 PM Sid <flinkbyhe...@gmail.com> wrote: >>>> >>>>> Hi Team, >>>>> >>>>> I was trying to understand the Salting technique for the column where >>>>> there would be a huge load on a single partition because of the same keys. >>>>> >>>>> I referred to one youtube video with the below understanding: >>>>> >>>>> So, using the salting technique we can actually change the joining >>>>> column values by appending some random number in a specified range. >>>>> >>>>> So, suppose I have these two values in a partition of two different >>>>> tables: >>>>> >>>>> Table A: >>>>> Partition1: >>>>> x >>>>> . >>>>> . >>>>> . >>>>> x >>>>> >>>>> Table B: >>>>> Partition1: >>>>> x >>>>> . >>>>> . >>>>> . >>>>> x >>>>> >>>>> After Salting it would be something like the below: >>>>> >>>>> Table A: >>>>> Partition1: >>>>> x_1 >>>>> >>>>> Partition 2: >>>>> x_2 >>>>> >>>>> Table B: >>>>> Partition1: >>>>> x_3 >>>>> >>>>> Partition 2: >>>>> x_8 >>>>> >>>>> Now, when I inner join these two tables after salting in order to >>>>> avoid data skewness problems, I won't get a match since the keys are >>>>> different after applying salting techniques. >>>>> >>>>> So how does this resolves the data skewness issue or if there is some >>>>> understanding gap? >>>>> >>>>> Could anyone help me in layman's terms? >>>>> >>>>> TIA, >>>>> Sid >>>>> >>>> -- Best Regards, Ayan Guha