Hi Amit,

Thanks for your reply. However, your answer doesn't seem different from
what I have explained.

My question is after salting if the keys are different like in my example
then post join there would be no results assuming the join type as inner
join because even though the keys are segregated in different partitions
based on unique keys they are not matching because x_1/x_2 !=x_8/x_9

How do you ensure that the results are matched?

Best,
Sid

On Sun, Jul 31, 2022 at 1:34 AM Amit Joshi <mailtojoshia...@gmail.com>
wrote:

> Hi Sid,
>
> Salting is normally a technique to add random characters to existing
> values.
> In big data we can use salting to deal with the skewness.
> Salting in join cas be used as :
> * Table A-*
> Col1, join_col , where join_col values are {x1, x2, x3}
> x1
> x1
> x1
> x2
> x2
> x3
>
> *Table B-*
> join_col, Col3 , where join_col  value are {x1, x2}
> x1
> x2
>
> *Problem: *Let say for table A, data is skewed on x1
> Now salting goes like this.  *Salt value =2*
> For
> *table A, *create a new col with values by salting join col
> *New_Join_Col*
> x1_1
> x1_2
> x1_1
> x2_1
> x2_2
> x3_1
>
> For *Table B,*
> Change the join_col to all possible values of the sale.
> join_col
> x1_1
> x1_2
> x2_1
> x2_2
>
> And then join it like
> table1.join(table2, where tableA.new_join_col == tableB. join_col)
>
> Let me know if you have any questions.
>
> Regards
> Amit Joshi
>
>
> On Sat, Jul 30, 2022 at 7:16 PM Sid <flinkbyhe...@gmail.com> wrote:
>
>> Hi Team,
>>
>> I was trying to understand the Salting technique for the column where
>> there would be a huge load on a single partition because of the same keys.
>>
>> I referred to one youtube video with the below understanding:
>>
>> So, using the salting technique we can actually change the joining column
>> values by appending some random number in a specified range.
>>
>> So, suppose I have these two values in a partition of two different
>> tables:
>>
>> Table A:
>> Partition1:
>> x
>> .
>> .
>> .
>> x
>>
>> Table B:
>> Partition1:
>> x
>> .
>> .
>> .
>> x
>>
>> After Salting it would be something like the below:
>>
>> Table A:
>> Partition1:
>> x_1
>>
>> Partition 2:
>> x_2
>>
>> Table B:
>> Partition1:
>> x_3
>>
>> Partition 2:
>> x_8
>>
>> Now, when I inner join these two tables after salting in order to avoid
>> data skewness problems, I won't get a match since the keys are different
>> after applying salting techniques.
>>
>> So how does this resolves the data skewness issue or if there is some
>> understanding gap?
>>
>> Could anyone help me in layman's terms?
>>
>> TIA,
>> Sid
>>
>

Reply via email to