Hi, I have a table (named DEDUPLICATED) that contains about 1 billions rows. Each day I receive a new file with about 5 millions rows. About 10% of those rows are duplicated (duplication occur inside a daily file but also between files). There are about 30 fields in the files.
As for now I deduplicate all the data every day with the following request : INSERT OVERWRITE TABLE DEDUPLICATED SELECT cl.* FROM ( SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos FROM DAILY d -- DAILY is an external table that contains all the daily files ) cl WHERE cl.pos = 1 On the mailing list I saw another approach base on a "group by KEY" request and use a 'select MAX(xxx)' for all non-key fields. My first question is : which of the both seems to be better ? (the second one is quite harder to maintain as all the fields should be explicitly written in the request). The second question is : what is the best way to do the deduplication and import on a incremental approach ? Something like that ? INSERT TABLE DEDUPLICATED SELECT cl.* FROM ( SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos FROM LAST_DAILY_FILE d -- ONLY the last file ) cl WHERE cl.pos = 1 -- REQUIRED to remove all the duplication inside the last file AND cl.KEY NOT IN SELECT KEY FROM DEDUPLICATED -- remove duplication between the last file and all the existing files And the last question : for the last request, does an index on KEY help with hive as it can help on a classical relational database ? Regards, Philippe -- Philippe Kernévez Directeur technique (Suisse), pkerne...@octo.com +41 79 888 33 32 Retrouvez OCTO sur OCTO Talk : http://blog.octo.com OCTO Technology http://www.octo.com