Hi Dev, > "Only suggestion is that please don't compress the files that are linked in the external table" . Why do you suggest that ? Presently, I divide my import time by 2 by using 'gz' files instead of the uncompressed 'csv'.
> "You may want to leverage dynamic partition" I read you post, and I wonder if it can scale. It seems that you have hourly and daily aggregation. More or less 24*365=8'800 raw per year. We have about 2 billions (2E9) raws per year. Do you know if you approach can scale on a such factor ? I also wonder about the impact of the partition columns on the map reduce job, they will probably produce a big number of files to produce those partitions. Regards, Philippe On Wed, Feb 18, 2015 at 2:49 AM, Devopam Mittra <devo...@gmail.com> wrote: > dear Philippe, > I asked specifically the version so as to understand what built-in > features you can leverage to your benefit. > > There are two ways that I can think of - to help you in your case : > 1. Use RANK as you have already noted in your original note - this will be > faster and more appropriate . Only suggestion is that please don't compress > the files that are linked in the external table. > > 2. You may want to leverage dynamic partition feature to your aid for > managing the duplicate records . I like it and use it extensively now a > days after 0.13 onward. Brief concept : dynamically partition the table on > one low skew column + 'key' column . Then simply INSERT OVERWRITE the > 'delta' and it will seamlessly overwrite only the affected rows and do the > redistribution of data in the table internally without you having to bother > about the code or the burden to do it. You may refer to this blog I wrote > quite some time back : http://linkd.in/1Fq3wdb > This technique will cost you a little overhead time with the MR job > getting kicked off and all , but is painless and seamless , so I like it > this way. > > regards > Dev > > On Wed, Feb 18, 2015 at 3:45 AM, Philippe Kernévez <pkerne...@octo.com> > wrote: > >> Hi Dev, >> >> I'm using hive 0.14. >> >> Regards, >> >> On Tue, Feb 17, 2015 at 8:12 PM, Devopam Mittra <devo...@gmail.com> >> wrote: >> >>> Can you please state the hive version >>> >>> regards >>> Dev >>> +91 958 305 9899 >>> >>> On Feb 17, 2015, at 11:01 PM, Philippe Kernévez <pkerne...@octo.com> >>> wrote: >>> >>> Hi, >>> >>> I have a table (named DEDUPLICATED) that contains about 1 billions rows. >>> Each day I receive a new file with about 5 millions rows. About 10% of >>> those rows are duplicated (duplication occur inside a daily file but also >>> between files). >>> There are about 30 fields in the files. >>> >>> As for now I deduplicate all the data every day with the following >>> request : >>> >>> INSERT OVERWRITE TABLE DEDUPLICATED >>> SELECT cl.* >>> FROM ( >>> SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos >>> FROM DAILY d -- DAILY is an external table that contains all >>> the daily files >>> ) cl >>> WHERE cl.pos = 1 >>> >>> On the mailing list I saw another approach base on a "group by KEY" >>> request and use a 'select MAX(xxx)' for all non-key fields. >>> >>> My first question is : which of the both seems to be better ? >>> (the second one is quite harder to maintain as all the fields should be >>> explicitly written in the request). >>> >>> >>> >>> The second question is : what is the best way to do the deduplication >>> and import on a incremental approach ? >>> Something like that ? >>> INSERT TABLE DEDUPLICATED >>> SELECT cl.* >>> FROM ( >>> SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos >>> FROM LAST_DAILY_FILE d -- ONLY the last file >>> ) cl >>> WHERE cl.pos = 1 -- REQUIRED to remove all the duplication inside >>> the last file >>> AND cl.KEY NOT IN SELECT KEY FROM DEDUPLICATED -- remove duplication >>> between the last file and all the existing files >>> >>> And the last question : for the last request, does an index on KEY help >>> with hive as it can help on a classical relational database ? >>> >>> Regards, >>> Philippe >>> >>> >>> >>> -- >>> Philippe Kernévez >>> >>> >>> >>> Directeur technique (Suisse), >>> pkerne...@octo.com >>> +41 79 888 33 32 >>> >>> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com >>> OCTO Technology http://www.octo.com >>> >>> >> >> >> -- >> Philippe Kernévez >> >> >> >> Directeur technique (Suisse), >> pkerne...@octo.com >> +41 79 888 33 32 >> >> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com >> OCTO Technology http://www.octo.com >> > > > > -- > Devopam Mittra > Life and Relations are not binary > -- Philippe Kernévez Directeur technique (Suisse), pkerne...@octo.com +41 79 888 33 32 Retrouvez OCTO sur OCTO Talk : http://blog.octo.com OCTO Technology http://www.octo.com