Hi Sid, I am not sure I understood your question. But the keys cannot be different post salting in both the tables, this is what i have shown in the explanation. You salt Table A and then explode Table B to create all possible values.
In your case, I do not understand, what Table B has x_8/9. It should be all possible values which you used to create salt. I hope you understand. Thanks On Sun, Jul 31, 2022 at 10:02 AM Sid <flinkbyhe...@gmail.com> wrote: > Hi Amit, > > Thanks for your reply. However, your answer doesn't seem different from > what I have explained. > > My question is after salting if the keys are different like in my example > then post join there would be no results assuming the join type as inner > join because even though the keys are segregated in different partitions > based on unique keys they are not matching because x_1/x_2 !=x_8/x_9 > > How do you ensure that the results are matched? > > Best, > Sid > > On Sun, Jul 31, 2022 at 1:34 AM Amit Joshi <mailtojoshia...@gmail.com> > wrote: > >> Hi Sid, >> >> Salting is normally a technique to add random characters to existing >> values. >> In big data we can use salting to deal with the skewness. >> Salting in join cas be used as : >> * Table A-* >> Col1, join_col , where join_col values are {x1, x2, x3} >> x1 >> x1 >> x1 >> x2 >> x2 >> x3 >> >> *Table B-* >> join_col, Col3 , where join_col value are {x1, x2} >> x1 >> x2 >> >> *Problem: *Let say for table A, data is skewed on x1 >> Now salting goes like this. *Salt value =2* >> For >> *table A, *create a new col with values by salting join col >> *New_Join_Col* >> x1_1 >> x1_2 >> x1_1 >> x2_1 >> x2_2 >> x3_1 >> >> For *Table B,* >> Change the join_col to all possible values of the sale. >> join_col >> x1_1 >> x1_2 >> x2_1 >> x2_2 >> >> And then join it like >> table1.join(table2, where tableA.new_join_col == tableB. join_col) >> >> Let me know if you have any questions. >> >> Regards >> Amit Joshi >> >> >> On Sat, Jul 30, 2022 at 7:16 PM Sid <flinkbyhe...@gmail.com> wrote: >> >>> Hi Team, >>> >>> I was trying to understand the Salting technique for the column where >>> there would be a huge load on a single partition because of the same keys. >>> >>> I referred to one youtube video with the below understanding: >>> >>> So, using the salting technique we can actually change the joining >>> column values by appending some random number in a specified range. >>> >>> So, suppose I have these two values in a partition of two different >>> tables: >>> >>> Table A: >>> Partition1: >>> x >>> . >>> . >>> . >>> x >>> >>> Table B: >>> Partition1: >>> x >>> . >>> . >>> . >>> x >>> >>> After Salting it would be something like the below: >>> >>> Table A: >>> Partition1: >>> x_1 >>> >>> Partition 2: >>> x_2 >>> >>> Table B: >>> Partition1: >>> x_3 >>> >>> Partition 2: >>> x_8 >>> >>> Now, when I inner join these two tables after salting in order to avoid >>> data skewness problems, I won't get a match since the keys are different >>> after applying salting techniques. >>> >>> So how does this resolves the data skewness issue or if there is some >>> understanding gap? >>> >>> Could anyone help me in layman's terms? >>> >>> TIA, >>> Sid >>> >>