Hi Dev and thank you for you On Wed, Feb 18, 2015 at 11:31 AM, Devopam Mittra <devo...@gmail.com> wrote:
> hi Philippe, > Performance improvement has two factors : 1. availability (read abundance) > of resources 2.need for speed > All "advise" usually is to address mainly these two factors , as I have > usually seen till this far. > > Since you are doing a full scan each day for identify dedups I suggested > keeping data uncompressed (in fact you may do the same only with a separate > table with key columns itself) > I there a difference - in term of performance - between another table with only the key columns and an index ? > > Apologies that the post didn't come out clear in terms of scalability and > you are definitely the best judge of the fit-gap analysis , being closest > to the issues. Dynamic partitions allows selective target for the table and > not a FTS so I presumed that scaling up should never be an issue ... > > Couldn't be of much help here I guess :( > Thanks a lot for your contribution. Philippe > regards > Dev > > > On Wed, Feb 18, 2015 at 3:02 PM, Philippe Kernévez <pkerne...@octo.com> > wrote: > >> Hi Dev, >> >> > "Only suggestion is that please don't compress the files that are >> linked in the external table" . >> Why do you suggest that ? Presently, I divide my import time by 2 by >> using 'gz' files instead of the uncompressed 'csv'. >> >> > "You may want to leverage dynamic partition" >> I read you post, and I wonder if it can scale. It seems that you have >> hourly and daily aggregation. More or less 24*365=8'800 raw per year. >> We have about 2 billions (2E9) raws per year. >> Do you know if you approach can scale on a such factor ? >> I also wonder about the impact of the partition columns on the map reduce >> job, they will probably produce a big number of files to produce those >> partitions. >> >> >> Regards, >> Philippe >> >> >> >> On Wed, Feb 18, 2015 at 2:49 AM, Devopam Mittra <devo...@gmail.com> >> wrote: >> >>> dear Philippe, >>> I asked specifically the version so as to understand what built-in >>> features you can leverage to your benefit. >>> >>> There are two ways that I can think of - to help you in your case : >>> 1. Use RANK as you have already noted in your original note - this will >>> be faster and more appropriate . Only suggestion is that please don't >>> compress the files that are linked in the external table. >>> >>> 2. You may want to leverage dynamic partition feature to your aid for >>> managing the duplicate records . I like it and use it extensively now a >>> days after 0.13 onward. Brief concept : dynamically partition the table on >>> one low skew column + 'key' column . Then simply INSERT OVERWRITE the >>> 'delta' and it will seamlessly overwrite only the affected rows and do the >>> redistribution of data in the table internally without you having to bother >>> about the code or the burden to do it. You may refer to this blog I wrote >>> quite some time back : http://linkd.in/1Fq3wdb >>> This technique will cost you a little overhead time with the MR job >>> getting kicked off and all , but is painless and seamless , so I like it >>> this way. >>> >>> regards >>> Dev >>> >>> On Wed, Feb 18, 2015 at 3:45 AM, Philippe Kernévez <pkerne...@octo.com> >>> wrote: >>> >>>> Hi Dev, >>>> >>>> I'm using hive 0.14. >>>> >>>> Regards, >>>> >>>> On Tue, Feb 17, 2015 at 8:12 PM, Devopam Mittra <devo...@gmail.com> >>>> wrote: >>>> >>>>> Can you please state the hive version >>>>> >>>>> regards >>>>> Dev >>>>> +91 958 305 9899 >>>>> >>>>> On Feb 17, 2015, at 11:01 PM, Philippe Kernévez <pkerne...@octo.com> >>>>> wrote: >>>>> >>>>> Hi, >>>>> >>>>> I have a table (named DEDUPLICATED) that contains about 1 billions >>>>> rows. >>>>> Each day I receive a new file with about 5 millions rows. About 10% >>>>> of those rows are duplicated (duplication occur inside a daily file but >>>>> also between files). >>>>> There are about 30 fields in the files. >>>>> >>>>> As for now I deduplicate all the data every day with the following >>>>> request : >>>>> >>>>> INSERT OVERWRITE TABLE DEDUPLICATED >>>>> SELECT cl.* >>>>> FROM ( >>>>> SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos >>>>> FROM DAILY d -- DAILY is an external table that contains all >>>>> the daily files >>>>> ) cl >>>>> WHERE cl.pos = 1 >>>>> >>>>> On the mailing list I saw another approach base on a "group by KEY" >>>>> request and use a 'select MAX(xxx)' for all non-key fields. >>>>> >>>>> My first question is : which of the both seems to be better ? >>>>> (the second one is quite harder to maintain as all the fields should >>>>> be explicitly written in the request). >>>>> >>>>> >>>>> >>>>> The second question is : what is the best way to do the deduplication >>>>> and import on a incremental approach ? >>>>> Something like that ? >>>>> INSERT TABLE DEDUPLICATED >>>>> SELECT cl.* >>>>> FROM ( >>>>> SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos >>>>> FROM LAST_DAILY_FILE d -- ONLY the last file >>>>> ) cl >>>>> WHERE cl.pos = 1 -- REQUIRED to remove all the duplication >>>>> inside the last file >>>>> AND cl.KEY NOT IN SELECT KEY FROM DEDUPLICATED -- remove >>>>> duplication between the last file and all the existing files >>>>> >>>>> And the last question : for the last request, does an index on KEY >>>>> help with hive as it can help on a classical relational database ? >>>>> >>>>> Regards, >>>>> Philippe >>>>> >>>>> >>>>> >>>>> -- >>>>> Philippe Kernévez >>>>> >>>>> >>>>> >>>>> Directeur technique (Suisse), >>>>> pkerne...@octo.com >>>>> +41 79 888 33 32 >>>>> >>>>> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com >>>>> OCTO Technology http://www.octo.com >>>>> >>>>> >>>> >>>> >>>> -- >>>> Philippe Kernévez >>>> >>>> >>>> >>>> Directeur technique (Suisse), >>>> pkerne...@octo.com >>>> +41 79 888 33 32 >>>> >>>> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com >>>> OCTO Technology http://www.octo.com >>>> >>> >>> >>> >>> -- >>> Devopam Mittra >>> Life and Relations are not binary >>> >> >> >> >> -- >> Philippe Kernévez >> >> >> >> Directeur technique (Suisse), >> pkerne...@octo.com >> +41 79 888 33 32 >> >> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com >> OCTO Technology http://www.octo.com >> > > > > -- > Devopam Mittra > Life and Relations are not binary > -- Philippe Kernévez Directeur technique (Suisse), pkerne...@octo.com +41 79 888 33 32 Retrouvez OCTO sur OCTO Talk : http://blog.octo.com OCTO Technology http://www.octo.com