Hi,

I have a table (named DEDUPLICATED) that contains about 1 billions rows.
Each day I receive a new file  with about 5 millions rows. About 10% of
those rows are duplicated (duplication occur inside a daily file but also
between files).
There are about 30 fields in the files.

As for now I deduplicate all the data every day with the following request :

  INSERT OVERWRITE TABLE DEDUPLICATED
    SELECT cl.*
    FROM (
        SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos
        FROM DAILY d  -- DAILY is an external table that contains all the
daily files
        ) cl
    WHERE cl.pos = 1

On the mailing list I saw another approach base on a "group by KEY" request
and use a 'select MAX(xxx)' for all non-key fields.

My first question is : which of the both seems to be better ?
(the second one is quite harder to maintain as all the fields should be
explicitly written in the request).



The second question is : what is the best way to do the deduplication and
import on a incremental approach ?
Something like that ?
  INSERT TABLE DEDUPLICATED
    SELECT cl.*
    FROM (
        SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos
        FROM LAST_DAILY_FILE d     -- ONLY the last file
        ) cl
    WHERE cl.pos = 1    -- REQUIRED to remove all the duplication inside
the last file
   AND cl.KEY NOT IN SELECT KEY FROM DEDUPLICATED  -- remove duplication
between the last file and all the existing files

And the last question : for the last request, does an index on KEY help
with hive as it can help on a classical relational database ?

Regards,
Philippe



-- 
Philippe Kernévez



Directeur technique (Suisse),
pkerne...@octo.com
+41 79 888 33 32

Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
OCTO Technology http://www.octo.com

Reply via email to