hi Philippe, Performance improvement has two factors : 1. availability (read abundance) of resources 2.need for speed All "advise" usually is to address mainly these two factors , as I have usually seen till this far.
Since you are doing a full scan each day for identify dedups I suggested keeping data uncompressed (in fact you may do the same only with a separate table with key columns itself) Apologies that the post didn't come out clear in terms of scalability and you are definitely the best judge of the fit-gap analysis , being closest to the issues. Dynamic partitions allows selective target for the table and not a FTS so I presumed that scaling up should never be an issue ... Couldn't be of much help here I guess :( regards Dev On Wed, Feb 18, 2015 at 3:02 PM, Philippe Kernévez <pkerne...@octo.com> wrote: > Hi Dev, > > > "Only suggestion is that please don't compress the files that are > linked in the external table" . > Why do you suggest that ? Presently, I divide my import time by 2 by using > 'gz' files instead of the uncompressed 'csv'. > > > "You may want to leverage dynamic partition" > I read you post, and I wonder if it can scale. It seems that you have > hourly and daily aggregation. More or less 24*365=8'800 raw per year. > We have about 2 billions (2E9) raws per year. > Do you know if you approach can scale on a such factor ? > I also wonder about the impact of the partition columns on the map reduce > job, they will probably produce a big number of files to produce those > partitions. > > > Regards, > Philippe > > > > On Wed, Feb 18, 2015 at 2:49 AM, Devopam Mittra <devo...@gmail.com> wrote: > >> dear Philippe, >> I asked specifically the version so as to understand what built-in >> features you can leverage to your benefit. >> >> There are two ways that I can think of - to help you in your case : >> 1. Use RANK as you have already noted in your original note - this will >> be faster and more appropriate . Only suggestion is that please don't >> compress the files that are linked in the external table. >> >> 2. You may want to leverage dynamic partition feature to your aid for >> managing the duplicate records . I like it and use it extensively now a >> days after 0.13 onward. Brief concept : dynamically partition the table on >> one low skew column + 'key' column . Then simply INSERT OVERWRITE the >> 'delta' and it will seamlessly overwrite only the affected rows and do the >> redistribution of data in the table internally without you having to bother >> about the code or the burden to do it. You may refer to this blog I wrote >> quite some time back : http://linkd.in/1Fq3wdb >> This technique will cost you a little overhead time with the MR job >> getting kicked off and all , but is painless and seamless , so I like it >> this way. >> >> regards >> Dev >> >> On Wed, Feb 18, 2015 at 3:45 AM, Philippe Kernévez <pkerne...@octo.com> >> wrote: >> >>> Hi Dev, >>> >>> I'm using hive 0.14. >>> >>> Regards, >>> >>> On Tue, Feb 17, 2015 at 8:12 PM, Devopam Mittra <devo...@gmail.com> >>> wrote: >>> >>>> Can you please state the hive version >>>> >>>> regards >>>> Dev >>>> +91 958 305 9899 >>>> >>>> On Feb 17, 2015, at 11:01 PM, Philippe Kernévez <pkerne...@octo.com> >>>> wrote: >>>> >>>> Hi, >>>> >>>> I have a table (named DEDUPLICATED) that contains about 1 billions rows. >>>> Each day I receive a new file with about 5 millions rows. About 10% of >>>> those rows are duplicated (duplication occur inside a daily file but also >>>> between files). >>>> There are about 30 fields in the files. >>>> >>>> As for now I deduplicate all the data every day with the following >>>> request : >>>> >>>> INSERT OVERWRITE TABLE DEDUPLICATED >>>> SELECT cl.* >>>> FROM ( >>>> SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos >>>> FROM DAILY d -- DAILY is an external table that contains all >>>> the daily files >>>> ) cl >>>> WHERE cl.pos = 1 >>>> >>>> On the mailing list I saw another approach base on a "group by KEY" >>>> request and use a 'select MAX(xxx)' for all non-key fields. >>>> >>>> My first question is : which of the both seems to be better ? >>>> (the second one is quite harder to maintain as all the fields should be >>>> explicitly written in the request). >>>> >>>> >>>> >>>> The second question is : what is the best way to do the deduplication >>>> and import on a incremental approach ? >>>> Something like that ? >>>> INSERT TABLE DEDUPLICATED >>>> SELECT cl.* >>>> FROM ( >>>> SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos >>>> FROM LAST_DAILY_FILE d -- ONLY the last file >>>> ) cl >>>> WHERE cl.pos = 1 -- REQUIRED to remove all the duplication >>>> inside the last file >>>> AND cl.KEY NOT IN SELECT KEY FROM DEDUPLICATED -- remove >>>> duplication between the last file and all the existing files >>>> >>>> And the last question : for the last request, does an index on KEY help >>>> with hive as it can help on a classical relational database ? >>>> >>>> Regards, >>>> Philippe >>>> >>>> >>>> >>>> -- >>>> Philippe Kernévez >>>> >>>> >>>> >>>> Directeur technique (Suisse), >>>> pkerne...@octo.com >>>> +41 79 888 33 32 >>>> >>>> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com >>>> OCTO Technology http://www.octo.com >>>> >>>> >>> >>> >>> -- >>> Philippe Kernévez >>> >>> >>> >>> Directeur technique (Suisse), >>> pkerne...@octo.com >>> +41 79 888 33 32 >>> >>> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com >>> OCTO Technology http://www.octo.com >>> >> >> >> >> -- >> Devopam Mittra >> Life and Relations are not binary >> > > > > -- > Philippe Kernévez > > > > Directeur technique (Suisse), > pkerne...@octo.com > +41 79 888 33 32 > > Retrouvez OCTO sur OCTO Talk : http://blog.octo.com > OCTO Technology http://www.octo.com > -- Devopam Mittra Life and Relations are not binary