The key is this line from Amit's email (emphasis added):

> Change the join_col to *all possible values* of the sale.

The two tables are treated asymmetrically:

1. The skewed table gets random salts appended to the join key.
2. The other table gets all possible salts appended to the join key (e.g.
using a range array literal + explode).

Thus guarantees that every row in the skewed table will match a row in the
other table. This StackOverflow answer
<https://stackoverflow.com/a/57951114/1892435> gives an example.

Op zo 31 jul. 2022 om 10:41 schreef Amit Joshi <mailtojoshia...@gmail.com>:

> Hi Sid,
>
> I am not sure I understood your question.
> But the keys cannot be different post salting in both the tables, this is
> what i have shown in the explanation.
> You salt Table A and then explode Table B to create all possible values.
>
> In your case, I do not understand, what Table B has x_8/9. It should be
> all possible values which you used to create salt.
>
> I hope you understand.
>
> Thanks
>
>
>
> On Sun, Jul 31, 2022 at 10:02 AM Sid <flinkbyhe...@gmail.com> wrote:
>
>> Hi Amit,
>>
>> Thanks for your reply. However, your answer doesn't seem different from
>> what I have explained.
>>
>> My question is after salting if the keys are different like in my example
>> then post join there would be no results assuming the join type as inner
>> join because even though the keys are segregated in different partitions
>> based on unique keys they are not matching because x_1/x_2 !=x_8/x_9
>>
>> How do you ensure that the results are matched?
>>
>> Best,
>> Sid
>>
>> On Sun, Jul 31, 2022 at 1:34 AM Amit Joshi <mailtojoshia...@gmail.com>
>> wrote:
>>
>>> Hi Sid,
>>>
>>> Salting is normally a technique to add random characters to existing
>>> values.
>>> In big data we can use salting to deal with the skewness.
>>> Salting in join cas be used as :
>>> * Table A-*
>>> Col1, join_col , where join_col values are {x1, x2, x3}
>>> x1
>>> x1
>>> x1
>>> x2
>>> x2
>>> x3
>>>
>>> *Table B-*
>>> join_col, Col3 , where join_col  value are {x1, x2}
>>> x1
>>> x2
>>>
>>> *Problem: *Let say for table A, data is skewed on x1
>>> Now salting goes like this.  *Salt value =2*
>>> For
>>> *table A, *create a new col with values by salting join col
>>> *New_Join_Col*
>>> x1_1
>>> x1_2
>>> x1_1
>>> x2_1
>>> x2_2
>>> x3_1
>>>
>>> For *Table B,*
>>> Change the join_col to all possible values of the sale.
>>> join_col
>>> x1_1
>>> x1_2
>>> x2_1
>>> x2_2
>>>
>>> And then join it like
>>> table1.join(table2, where tableA.new_join_col == tableB. join_col)
>>>
>>> Let me know if you have any questions.
>>>
>>> Regards
>>> Amit Joshi
>>>
>>>
>>> On Sat, Jul 30, 2022 at 7:16 PM Sid <flinkbyhe...@gmail.com> wrote:
>>>
>>>> Hi Team,
>>>>
>>>> I was trying to understand the Salting technique for the column where
>>>> there would be a huge load on a single partition because of the same keys.
>>>>
>>>> I referred to one youtube video with the below understanding:
>>>>
>>>> So, using the salting technique we can actually change the joining
>>>> column values by appending some random number in a specified range.
>>>>
>>>> So, suppose I have these two values in a partition of two different
>>>> tables:
>>>>
>>>> Table A:
>>>> Partition1:
>>>> x
>>>> .
>>>> .
>>>> .
>>>> x
>>>>
>>>> Table B:
>>>> Partition1:
>>>> x
>>>> .
>>>> .
>>>> .
>>>> x
>>>>
>>>> After Salting it would be something like the below:
>>>>
>>>> Table A:
>>>> Partition1:
>>>> x_1
>>>>
>>>> Partition 2:
>>>> x_2
>>>>
>>>> Table B:
>>>> Partition1:
>>>> x_3
>>>>
>>>> Partition 2:
>>>> x_8
>>>>
>>>> Now, when I inner join these two tables after salting in order to avoid
>>>> data skewness problems, I won't get a match since the keys are different
>>>> after applying salting techniques.
>>>>
>>>> So how does this resolves the data skewness issue or if there is some
>>>> understanding gap?
>>>>
>>>> Could anyone help me in layman's terms?
>>>>
>>>> TIA,
>>>> Sid
>>>>
>>>

Reply via email to