Re: Remove duplicated rows

Philippe Kernévez Thu, 19 Feb 2015 04:35:49 -0800

Hi Dev and thank you for you

On Wed, Feb 18, 2015 at 11:31 AM, Devopam Mittra <devo...@gmail.com> wrote:


> hi Philippe,
> Performance improvement has two factors : 1. availability (read abundance)
> of resources 2.need for speed
> All "advise" usually is to address mainly these two factors , as I have
> usually seen till this far.
>
> Since you are doing a full scan each day for identify dedups I suggested
> keeping data uncompressed (in fact you may do the same only with a separate
> table with key columns itself)
>

I there a difference - in term of performance - between another table with
only the key columns and an index ?


>
> Apologies that the post didn't come out clear in terms of scalability and
> you are definitely the best judge of the fit-gap analysis , being closest
> to the issues. Dynamic partitions allows selective target for the table and
> not a FTS so I presumed that scaling up should never be an issue ...
>
> Couldn't be of much help here I guess :(
>

Thanks a lot for your contribution.
Philippe


> regards
> Dev
>
>
> On Wed, Feb 18, 2015 at 3:02 PM, Philippe Kernévez <pkerne...@octo.com>
> wrote:
>
>> Hi Dev,
>>
>> > "Only suggestion is that please don't compress the files that are
>> linked in the external table" .
>> Why do you suggest that ? Presently, I divide my import time by 2 by
>> using 'gz' files instead of the uncompressed 'csv'.
>>
>> > "You may want to leverage dynamic partition"
>> I read you post, and I wonder if it can scale. It seems that you have
>> hourly and daily aggregation. More or less 24*365=8'800 raw per year.
>> We have about 2 billions (2E9) raws per year.
>> Do you know if you approach can scale on a such factor ?
>> I also wonder about the impact of the partition columns on the map reduce
>> job, they will probably produce a big number of files to produce those
>> partitions.
>>
>>
>> Regards,
>> Philippe
>>
>>
>>
>> On Wed, Feb 18, 2015 at 2:49 AM, Devopam Mittra <devo...@gmail.com>
>> wrote:
>>
>>> dear Philippe,
>>> I asked specifically the version so as to understand what built-in
>>> features you can leverage to your benefit.
>>>
>>> There are two ways that I can think of - to help you in your case :
>>> 1. Use RANK as you have already noted in your original note - this will
>>> be faster and more appropriate . Only suggestion is that please don't
>>> compress the files that are linked in the external table.
>>>
>>> 2. You may want to leverage dynamic partition feature to your aid for
>>> managing the duplicate records . I like it and use it extensively now a
>>> days after 0.13 onward. Brief concept : dynamically partition the table on
>>> one low skew column + 'key' column . Then simply INSERT OVERWRITE the
>>> 'delta' and it will seamlessly overwrite only the affected rows and do the
>>> redistribution of data in the table internally without you having to bother
>>> about the code or the burden to do it. You may refer to this blog I wrote
>>> quite some time back : http://linkd.in/1Fq3wdb
>>> This technique will cost you a little overhead time with the MR job
>>> getting kicked off and all , but is painless and seamless , so I like it
>>> this way.
>>>
>>> regards
>>> Dev
>>>
>>> On Wed, Feb 18, 2015 at 3:45 AM, Philippe Kernévez <pkerne...@octo.com>
>>> wrote:
>>>
>>>> Hi Dev,
>>>>
>>>> I'm using hive 0.14.
>>>>
>>>> Regards,
>>>>
>>>> On Tue, Feb 17, 2015 at 8:12 PM, Devopam Mittra <devo...@gmail.com>
>>>> wrote:
>>>>
>>>>> Can you please state the hive version
>>>>>
>>>>> regards
>>>>> Dev
>>>>> +91 958 305 9899
>>>>>
>>>>> On Feb 17, 2015, at 11:01 PM, Philippe Kernévez <pkerne...@octo.com>
>>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I have a table (named DEDUPLICATED) that contains about 1 billions
>>>>> rows.
>>>>> Each day I receive a new file  with about 5 millions rows. About 10%
>>>>> of those rows are duplicated (duplication occur inside a daily file but
>>>>> also between files).
>>>>> There are about 30 fields in the files.
>>>>>
>>>>> As for now I deduplicate all the data every day with the following
>>>>> request :
>>>>>
>>>>>   INSERT OVERWRITE TABLE DEDUPLICATED
>>>>>     SELECT cl.*
>>>>>     FROM (
>>>>>         SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos
>>>>>         FROM DAILY d  -- DAILY is an external table that contains all
>>>>> the daily files
>>>>>         ) cl
>>>>>     WHERE cl.pos = 1
>>>>>
>>>>> On the mailing list I saw another approach base on a "group by KEY"
>>>>> request and use a 'select MAX(xxx)' for all non-key fields.
>>>>>
>>>>> My first question is : which of the both seems to be better ?
>>>>> (the second one is quite harder to maintain as all the fields should
>>>>> be explicitly written in the request).
>>>>>
>>>>>
>>>>>
>>>>> The second question is : what is the best way to do the deduplication
>>>>> and import on a incremental approach ?
>>>>> Something like that ?
>>>>>   INSERT TABLE DEDUPLICATED
>>>>>     SELECT cl.*
>>>>>     FROM (
>>>>>         SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos
>>>>>         FROM LAST_DAILY_FILE d     -- ONLY the last file
>>>>>         ) cl
>>>>>     WHERE cl.pos = 1    -- REQUIRED to remove all the duplication
>>>>> inside the last file
>>>>>    AND cl.KEY NOT IN SELECT KEY FROM DEDUPLICATED  -- remove
>>>>> duplication between the last file and all the existing files
>>>>>
>>>>> And the last question : for the last request, does an index on KEY
>>>>> help with hive as it can help on a classical relational database ?
>>>>>
>>>>> Regards,
>>>>> Philippe
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Philippe Kernévez
>>>>>
>>>>>
>>>>>
>>>>> Directeur technique (Suisse),
>>>>> pkerne...@octo.com
>>>>> +41 79 888 33 32
>>>>>
>>>>> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
>>>>> OCTO Technology http://www.octo.com
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Philippe Kernévez
>>>>
>>>>
>>>>
>>>> Directeur technique (Suisse),
>>>> pkerne...@octo.com
>>>> +41 79 888 33 32
>>>>
>>>> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
>>>> OCTO Technology http://www.octo.com
>>>>
>>>
>>>
>>>
>>> --
>>> Devopam Mittra
>>> Life and Relations are not binary
>>>
>>
>>
>>
>> --
>> Philippe Kernévez
>>
>>
>>
>> Directeur technique (Suisse),
>> pkerne...@octo.com
>> +41 79 888 33 32
>>
>> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
>> OCTO Technology http://www.octo.com
>>
>
>
>
> --
> Devopam Mittra
> Life and Relations are not binary
>



-- 
Philippe Kernévez



Directeur technique (Suisse),
pkerne...@octo.com
+41 79 888 33 32

Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
OCTO Technology http://www.octo.com

Re: Remove duplicated rows

Reply via email to