We've solved similar cases by using UDAF + UDTF, something like.. select assign(unique(a, b, c, d)) as (a, b, c, d), e, f from A group by a, b, c, d;
unique(UDAF) returns struct and assign(UDTF) reassigns names for the struct, where you can append a column for marking. HIVE-2608 may make this shorter. 2012/8/18 Himanish Kushary <himan...@gmail.com> > We want to separate them out into different tables for auditing purposes, > so i was thinking of marking them and then > separating them out based on the "duplicate" flag. > > Is there a better way of splitting them into 2 tables ? Also how would I > pick up the first from the group and put into a different table > once I have run the cross join on fld1 thru fld4 with order by timestamp? > > - Himanish > > On Fri, Aug 17, 2012 at 11:54 AM, Bertrand Dechoux <decho...@gmail.com>wrote: > >> If I understand correctly : >> A cross join on fld1 thru fld4 with order by timestamp and take the first >> of each group. >> would allow you to have all the unique and original copies. >> >> But indeed if you don't have something like HBase, you can not mark them >> really but you can select them and write them elsewhere. >> Why do you specifically want to mark them? >> >> Bertrand >> >> >> On Fri, Aug 17, 2012 at 5:46 PM, Bob Gause <bob.ga...@zyquest.com> wrote: >> >>> We use com.facebook.hive.udf.UDFNumberRows to do a ranking by time in >>> some of our queries. You could do that, and then do another select where >>> the row number/rank is 1 to get all the "unique" rows. >>> >>> There are probably a bunch of other ways to do this, but this is the one >>> that first came to mind for me…. >>> >>> Enjoy! >>> Bob >>> >>> Robert Gause >>> Senior Systems Engineer >>> ZyQuest, Inc. >>> bob.ga...@zyquest.com >>> >>> On Aug 17, 2012, at 9:49 AM, Himanish Kushary wrote: >>> >>> > Hi, >>> > >>> > We have a huge table which may have duplicate records.A record is >>> considered duplicate based on 4 fields ( fld1 thru fld4) . We need to >>> identify the duplicate records and possibly mark the duplicates(except the >>> first record based on created time for a record). >>> > >>> > Is this something that could be done by hive or we need to write >>> custom M/R for this.Could a inner join or a select with group by be used to >>> find the duplicates ? How do I mark the duplicate records as there is no >>> update. >>> > >>> > Whats the best way to do this using Hive ? Looking forward to hear the >>> suggestions. >>> > >>> > Thanks >>> >>> >> >> >> -- >> Bertrand Dechoux >> > > > > -- > Thanks & Regards > Himanish >