Can you please state the hive version regards Dev +91 958 305 9899
> On Feb 17, 2015, at 11:01 PM, Philippe Kernévez <pkerne...@octo.com> wrote: > > Hi, > > I have a table (named DEDUPLICATED) that contains about 1 billions rows. > Each day I receive a new file with about 5 millions rows. About 10% of those > rows are duplicated (duplication occur inside a daily file but also between > files). > There are about 30 fields in the files. > > As for now I deduplicate all the data every day with the following request : > > INSERT OVERWRITE TABLE DEDUPLICATED > SELECT cl.* > FROM ( > SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos > FROM DAILY d -- DAILY is an external table that contains all the > daily files > ) cl > WHERE cl.pos = 1 > > On the mailing list I saw another approach base on a "group by KEY" request > and use a 'select MAX(xxx)' for all non-key fields. > > My first question is : which of the both seems to be better ? > (the second one is quite harder to maintain as all the fields should be > explicitly written in the request). > > > > The second question is : what is the best way to do the deduplication and > import on a incremental approach ? > Something like that ? > INSERT TABLE DEDUPLICATED > SELECT cl.* > FROM ( > SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos > FROM LAST_DAILY_FILE d -- ONLY the last file > ) cl > WHERE cl.pos = 1 -- REQUIRED to remove all the duplication inside the > last file > AND cl.KEY NOT IN SELECT KEY FROM DEDUPLICATED -- remove duplication > between the last file and all the existing files > > And the last question : for the last request, does an index on KEY help with > hive as it can help on a classical relational database ? > > Regards, > Philippe > > > > -- > Philippe Kernévez > > > > Directeur technique (Suisse), > pkerne...@octo.com > +41 79 888 33 32 > > Retrouvez OCTO sur OCTO Talk : http://blog.octo.com > OCTO Technology http://www.octo.com