One option is create a separate column in table A with salting. Use it as
partition key. Use original column for joining.

Ayan

On Sun, 31 Jul 2022 at 6:45 pm, Jacob Lynn <abebopare...@gmail.com> wrote:

> The key is this line from Amit's email (emphasis added):
>
> > Change the join_col to *all possible values* of the sale.
>
> The two tables are treated asymmetrically:
>
> 1. The skewed table gets random salts appended to the join key.
> 2. The other table gets all possible salts appended to the join key (e.g.
> using a range array literal + explode).
>
> Thus guarantees that every row in the skewed table will match a row in the
> other table. This StackOverflow answer
> <https://stackoverflow.com/a/57951114/1892435> gives an example.
>
> Op zo 31 jul. 2022 om 10:41 schreef Amit Joshi <mailtojoshia...@gmail.com
> >:
>
>> Hi Sid,
>>
>> I am not sure I understood your question.
>> But the keys cannot be different post salting in both the tables, this is
>> what i have shown in the explanation.
>> You salt Table A and then explode Table B to create all possible values.
>>
>> In your case, I do not understand, what Table B has x_8/9. It should be
>> all possible values which you used to create salt.
>>
>> I hope you understand.
>>
>> Thanks
>>
>>
>>
>> On Sun, Jul 31, 2022 at 10:02 AM Sid <flinkbyhe...@gmail.com> wrote:
>>
>>> Hi Amit,
>>>
>>> Thanks for your reply. However, your answer doesn't seem different from
>>> what I have explained.
>>>
>>> My question is after salting if the keys are different like in my
>>> example then post join there would be no results assuming the join type as
>>> inner join because even though the keys are segregated in different
>>> partitions based on unique keys they are not matching because x_1/x_2
>>> !=x_8/x_9
>>>
>>> How do you ensure that the results are matched?
>>>
>>> Best,
>>> Sid
>>>
>>> On Sun, Jul 31, 2022 at 1:34 AM Amit Joshi <mailtojoshia...@gmail.com>
>>> wrote:
>>>
>>>> Hi Sid,
>>>>
>>>> Salting is normally a technique to add random characters to existing
>>>> values.
>>>> In big data we can use salting to deal with the skewness.
>>>> Salting in join cas be used as :
>>>> * Table A-*
>>>> Col1, join_col , where join_col values are {x1, x2, x3}
>>>> x1
>>>> x1
>>>> x1
>>>> x2
>>>> x2
>>>> x3
>>>>
>>>> *Table B-*
>>>> join_col, Col3 , where join_col  value are {x1, x2}
>>>> x1
>>>> x2
>>>>
>>>> *Problem: *Let say for table A, data is skewed on x1
>>>> Now salting goes like this.  *Salt value =2*
>>>> For
>>>> *table A, *create a new col with values by salting join col
>>>> *New_Join_Col*
>>>> x1_1
>>>> x1_2
>>>> x1_1
>>>> x2_1
>>>> x2_2
>>>> x3_1
>>>>
>>>> For *Table B,*
>>>> Change the join_col to all possible values of the sale.
>>>> join_col
>>>> x1_1
>>>> x1_2
>>>> x2_1
>>>> x2_2
>>>>
>>>> And then join it like
>>>> table1.join(table2, where tableA.new_join_col == tableB. join_col)
>>>>
>>>> Let me know if you have any questions.
>>>>
>>>> Regards
>>>> Amit Joshi
>>>>
>>>>
>>>> On Sat, Jul 30, 2022 at 7:16 PM Sid <flinkbyhe...@gmail.com> wrote:
>>>>
>>>>> Hi Team,
>>>>>
>>>>> I was trying to understand the Salting technique for the column where
>>>>> there would be a huge load on a single partition because of the same keys.
>>>>>
>>>>> I referred to one youtube video with the below understanding:
>>>>>
>>>>> So, using the salting technique we can actually change the joining
>>>>> column values by appending some random number in a specified range.
>>>>>
>>>>> So, suppose I have these two values in a partition of two different
>>>>> tables:
>>>>>
>>>>> Table A:
>>>>> Partition1:
>>>>> x
>>>>> .
>>>>> .
>>>>> .
>>>>> x
>>>>>
>>>>> Table B:
>>>>> Partition1:
>>>>> x
>>>>> .
>>>>> .
>>>>> .
>>>>> x
>>>>>
>>>>> After Salting it would be something like the below:
>>>>>
>>>>> Table A:
>>>>> Partition1:
>>>>> x_1
>>>>>
>>>>> Partition 2:
>>>>> x_2
>>>>>
>>>>> Table B:
>>>>> Partition1:
>>>>> x_3
>>>>>
>>>>> Partition 2:
>>>>> x_8
>>>>>
>>>>> Now, when I inner join these two tables after salting in order to
>>>>> avoid data skewness problems, I won't get a match since the keys are
>>>>> different after applying salting techniques.
>>>>>
>>>>> So how does this resolves the data skewness issue or if there is some
>>>>> understanding gap?
>>>>>
>>>>> Could anyone help me in layman's terms?
>>>>>
>>>>> TIA,
>>>>> Sid
>>>>>
>>>> --
Best Regards,
Ayan Guha

Reply via email to