Re: Remove duplicated rows

Devopam Mittra Tue, 17 Feb 2015 11:20:45 -0800

Can you please state the hive version 

regards
Dev
+91 958 305 9899


> On Feb 17, 2015, at 11:01 PM, Philippe Kernévez <pkerne...@octo.com> wrote:
> 
> Hi,
> 
> I have a table (named DEDUPLICATED) that contains about 1 billions rows.
> Each day I receive a new file  with about 5 millions rows. About 10% of those 
> rows are duplicated (duplication occur inside a daily file but also between 
> files).
> There are about 30 fields in the files.
> 
> As for now I deduplicate all the data every day with the following request :
> 
>   INSERT OVERWRITE TABLE DEDUPLICATED
>     SELECT cl.* 
>     FROM (
>         SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos 
>         FROM DAILY d  -- DAILY is an external table that contains all the 
> daily files
>         ) cl
>     WHERE cl.pos = 1    
>  
> On the mailing list I saw another approach base on a "group by KEY" request 
> and use a 'select MAX(xxx)' for all non-key fields.
> 
> My first question is : which of the both seems to be better ?
> (the second one is quite harder to maintain as all the fields should be 
> explicitly written in the request).
> 
> 
> 
> The second question is : what is the best way to do the deduplication and 
> import on a incremental approach ?
> Something like that ?
>   INSERT TABLE DEDUPLICATED
>     SELECT cl.* 
>     FROM (
>         SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos 
>         FROM LAST_DAILY_FILE d     -- ONLY the last file
>         ) cl
>     WHERE cl.pos = 1    -- REQUIRED to remove all the duplication inside the 
> last file
>    AND cl.KEY NOT IN SELECT KEY FROM DEDUPLICATED  -- remove duplication 
> between the last file and all the existing files
> 
> And the last question : for the last request, does an index on KEY help with 
> hive as it can help on a classical relational database ?
> 
> Regards,
> Philippe
> 
> 
> 
> -- 
> Philippe Kernévez
> 
> 
> 
> Directeur technique (Suisse), 
> pkerne...@octo.com
> +41 79 888 33 32
> 
> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
> OCTO Technology http://www.octo.com

Re: Remove duplicated rows

Reply via email to