Re: Remove duplicated rows

Devopam Mittra Wed, 18 Feb 2015 02:35:22 -0800

hi Philippe,
Performance improvement has two factors : 1. availability (read abundance)
of resources 2.need for speed
All "advise" usually is to address mainly these two factors , as I have
usually seen till this far.


Since you are doing a full scan each day for identify dedups I suggested
keeping data uncompressed (in fact you may do the same only with a separate
table with key columns itself)

Apologies that the post didn't come out clear in terms of scalability and
you are definitely the best judge of the fit-gap analysis , being closest
to the issues. Dynamic partitions allows selective target for the table and
not a FTS so I presumed that scaling up should never be an issue ...

Couldn't be of much help here I guess :(
regards
Dev


On Wed, Feb 18, 2015 at 3:02 PM, Philippe Kernévez <pkerne...@octo.com>
wrote:

> Hi Dev,
>
> > "Only suggestion is that please don't compress the files that are
> linked in the external table" .
> Why do you suggest that ? Presently, I divide my import time by 2 by using
> 'gz' files instead of the uncompressed 'csv'.
>
> > "You may want to leverage dynamic partition"
> I read you post, and I wonder if it can scale. It seems that you have
> hourly and daily aggregation. More or less 24*365=8'800 raw per year.
> We have about 2 billions (2E9) raws per year.
> Do you know if you approach can scale on a such factor ?
> I also wonder about the impact of the partition columns on the map reduce
> job, they will probably produce a big number of files to produce those
> partitions.
>
>
> Regards,
> Philippe
>
>
>
> On Wed, Feb 18, 2015 at 2:49 AM, Devopam Mittra <devo...@gmail.com> wrote:
>
>> dear Philippe,
>> I asked specifically the version so as to understand what built-in
>> features you can leverage to your benefit.
>>
>> There are two ways that I can think of - to help you in your case :
>> 1. Use RANK as you have already noted in your original note - this will
>> be faster and more appropriate . Only suggestion is that please don't
>> compress the files that are linked in the external table.
>>
>> 2. You may want to leverage dynamic partition feature to your aid for
>> managing the duplicate records . I like it and use it extensively now a
>> days after 0.13 onward. Brief concept : dynamically partition the table on
>> one low skew column + 'key' column . Then simply INSERT OVERWRITE the
>> 'delta' and it will seamlessly overwrite only the affected rows and do the
>> redistribution of data in the table internally without you having to bother
>> about the code or the burden to do it. You may refer to this blog I wrote
>> quite some time back : http://linkd.in/1Fq3wdb
>> This technique will cost you a little overhead time with the MR job
>> getting kicked off and all , but is painless and seamless , so I like it
>> this way.
>>
>> regards
>> Dev
>>
>> On Wed, Feb 18, 2015 at 3:45 AM, Philippe Kernévez <pkerne...@octo.com>
>> wrote:
>>
>>> Hi Dev,
>>>
>>> I'm using hive 0.14.
>>>
>>> Regards,
>>>
>>> On Tue, Feb 17, 2015 at 8:12 PM, Devopam Mittra <devo...@gmail.com>
>>> wrote:
>>>
>>>> Can you please state the hive version
>>>>
>>>> regards
>>>> Dev
>>>> +91 958 305 9899
>>>>
>>>> On Feb 17, 2015, at 11:01 PM, Philippe Kernévez <pkerne...@octo.com>
>>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I have a table (named DEDUPLICATED) that contains about 1 billions rows.
>>>> Each day I receive a new file  with about 5 millions rows. About 10% of
>>>> those rows are duplicated (duplication occur inside a daily file but also
>>>> between files).
>>>> There are about 30 fields in the files.
>>>>
>>>> As for now I deduplicate all the data every day with the following
>>>> request :
>>>>
>>>>   INSERT OVERWRITE TABLE DEDUPLICATED
>>>>     SELECT cl.*
>>>>     FROM (
>>>>         SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos
>>>>         FROM DAILY d  -- DAILY is an external table that contains all
>>>> the daily files
>>>>         ) cl
>>>>     WHERE cl.pos = 1
>>>>
>>>> On the mailing list I saw another approach base on a "group by KEY"
>>>> request and use a 'select MAX(xxx)' for all non-key fields.
>>>>
>>>> My first question is : which of the both seems to be better ?
>>>> (the second one is quite harder to maintain as all the fields should be
>>>> explicitly written in the request).
>>>>
>>>>
>>>>
>>>> The second question is : what is the best way to do the deduplication
>>>> and import on a incremental approach ?
>>>> Something like that ?
>>>>   INSERT TABLE DEDUPLICATED
>>>>     SELECT cl.*
>>>>     FROM (
>>>>         SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos
>>>>         FROM LAST_DAILY_FILE d     -- ONLY the last file
>>>>         ) cl
>>>>     WHERE cl.pos = 1    -- REQUIRED to remove all the duplication
>>>> inside the last file
>>>>    AND cl.KEY NOT IN SELECT KEY FROM DEDUPLICATED  -- remove
>>>> duplication between the last file and all the existing files
>>>>
>>>> And the last question : for the last request, does an index on KEY help
>>>> with hive as it can help on a classical relational database ?
>>>>
>>>> Regards,
>>>> Philippe
>>>>
>>>>
>>>>
>>>> --
>>>> Philippe Kernévez
>>>>
>>>>
>>>>
>>>> Directeur technique (Suisse),
>>>> pkerne...@octo.com
>>>> +41 79 888 33 32
>>>>
>>>> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
>>>> OCTO Technology http://www.octo.com
>>>>
>>>>
>>>
>>>
>>> --
>>> Philippe Kernévez
>>>
>>>
>>>
>>> Directeur technique (Suisse),
>>> pkerne...@octo.com
>>> +41 79 888 33 32
>>>
>>> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
>>> OCTO Technology http://www.octo.com
>>>
>>
>>
>>
>> --
>> Devopam Mittra
>> Life and Relations are not binary
>>
>
>
>
> --
> Philippe Kernévez
>
>
>
> Directeur technique (Suisse),
> pkerne...@octo.com
> +41 79 888 33 32
>
> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
> OCTO Technology http://www.octo.com
>



-- 
Devopam Mittra
Life and Relations are not binary

Re: Remove duplicated rows

Reply via email to