Re: Remove duplicated rows

Devopam Mittra Tue, 17 Feb 2015 17:50:53 -0800

dear Philippe,
I asked specifically the version so as to understand what built-in features
you can leverage to your benefit.


There are two ways that I can think of - to help you in your case :
1. Use RANK as you have already noted in your original note - this will be
faster and more appropriate . Only suggestion is that please don't compress
the files that are linked in the external table.

2. You may want to leverage dynamic partition feature to your aid for
managing the duplicate records . I like it and use it extensively now a
days after 0.13 onward. Brief concept : dynamically partition the table on
one low skew column + 'key' column . Then simply INSERT OVERWRITE the
'delta' and it will seamlessly overwrite only the affected rows and do the
redistribution of data in the table internally without you having to bother
about the code or the burden to do it. You may refer to this blog I wrote
quite some time back : http://linkd.in/1Fq3wdb
This technique will cost you a little overhead time with the MR job getting
kicked off and all , but is painless and seamless , so I like it this way.

regards
Dev

On Wed, Feb 18, 2015 at 3:45 AM, Philippe Kernévez <pkerne...@octo.com>
wrote:

> Hi Dev,
>
> I'm using hive 0.14.
>
> Regards,
>
> On Tue, Feb 17, 2015 at 8:12 PM, Devopam Mittra <devo...@gmail.com> wrote:
>
>> Can you please state the hive version
>>
>> regards
>> Dev
>> +91 958 305 9899
>>
>> On Feb 17, 2015, at 11:01 PM, Philippe Kernévez <pkerne...@octo.com>
>> wrote:
>>
>> Hi,
>>
>> I have a table (named DEDUPLICATED) that contains about 1 billions rows.
>> Each day I receive a new file  with about 5 millions rows. About 10% of
>> those rows are duplicated (duplication occur inside a daily file but also
>> between files).
>> There are about 30 fields in the files.
>>
>> As for now I deduplicate all the data every day with the following
>> request :
>>
>>   INSERT OVERWRITE TABLE DEDUPLICATED
>>     SELECT cl.*
>>     FROM (
>>         SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos
>>         FROM DAILY d  -- DAILY is an external table that contains all the
>> daily files
>>         ) cl
>>     WHERE cl.pos = 1
>>
>> On the mailing list I saw another approach base on a "group by KEY"
>> request and use a 'select MAX(xxx)' for all non-key fields.
>>
>> My first question is : which of the both seems to be better ?
>> (the second one is quite harder to maintain as all the fields should be
>> explicitly written in the request).
>>
>>
>>
>> The second question is : what is the best way to do the deduplication and
>> import on a incremental approach ?
>> Something like that ?
>>   INSERT TABLE DEDUPLICATED
>>     SELECT cl.*
>>     FROM (
>>         SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos
>>         FROM LAST_DAILY_FILE d     -- ONLY the last file
>>         ) cl
>>     WHERE cl.pos = 1    -- REQUIRED to remove all the duplication inside
>> the last file
>>    AND cl.KEY NOT IN SELECT KEY FROM DEDUPLICATED  -- remove duplication
>> between the last file and all the existing files
>>
>> And the last question : for the last request, does an index on KEY help
>> with hive as it can help on a classical relational database ?
>>
>> Regards,
>> Philippe
>>
>>
>>
>> --
>> Philippe Kernévez
>>
>>
>>
>> Directeur technique (Suisse),
>> pkerne...@octo.com
>> +41 79 888 33 32
>>
>> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
>> OCTO Technology http://www.octo.com
>>
>>
>
>
> --
> Philippe Kernévez
>
>
>
> Directeur technique (Suisse),
> pkerne...@octo.com
> +41 79 888 33 32
>
> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
> OCTO Technology http://www.octo.com
>



-- 
Devopam Mittra
Life and Relations are not binary

Re: Remove duplicated rows

Reply via email to