Hi Dev,

I'm using hive 0.14.

Regards,

On Tue, Feb 17, 2015 at 8:12 PM, Devopam Mittra <devo...@gmail.com> wrote:

> Can you please state the hive version
>
> regards
> Dev
> +91 958 305 9899
>
> On Feb 17, 2015, at 11:01 PM, Philippe Kernévez <pkerne...@octo.com>
> wrote:
>
> Hi,
>
> I have a table (named DEDUPLICATED) that contains about 1 billions rows.
> Each day I receive a new file  with about 5 millions rows. About 10% of
> those rows are duplicated (duplication occur inside a daily file but also
> between files).
> There are about 30 fields in the files.
>
> As for now I deduplicate all the data every day with the following request
> :
>
>   INSERT OVERWRITE TABLE DEDUPLICATED
>     SELECT cl.*
>     FROM (
>         SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos
>         FROM DAILY d  -- DAILY is an external table that contains all the
> daily files
>         ) cl
>     WHERE cl.pos = 1
>
> On the mailing list I saw another approach base on a "group by KEY"
> request and use a 'select MAX(xxx)' for all non-key fields.
>
> My first question is : which of the both seems to be better ?
> (the second one is quite harder to maintain as all the fields should be
> explicitly written in the request).
>
>
>
> The second question is : what is the best way to do the deduplication and
> import on a incremental approach ?
> Something like that ?
>   INSERT TABLE DEDUPLICATED
>     SELECT cl.*
>     FROM (
>         SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos
>         FROM LAST_DAILY_FILE d     -- ONLY the last file
>         ) cl
>     WHERE cl.pos = 1    -- REQUIRED to remove all the duplication inside
> the last file
>    AND cl.KEY NOT IN SELECT KEY FROM DEDUPLICATED  -- remove duplication
> between the last file and all the existing files
>
> And the last question : for the last request, does an index on KEY help
> with hive as it can help on a classical relational database ?
>
> Regards,
> Philippe
>
>
>
> --
> Philippe Kernévez
>
>
>
> Directeur technique (Suisse),
> pkerne...@octo.com
> +41 79 888 33 32
>
> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
> OCTO Technology http://www.octo.com
>
>


-- 
Philippe Kernévez



Directeur technique (Suisse),
pkerne...@octo.com
+41 79 888 33 32

Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
OCTO Technology http://www.octo.com

Reply via email to