Thanks everyone for the different possible solutions.We resolved this by using the rank function and following the instructions at https://github.com/edwardcapriolo/hive-rank
This is similar to Bob's approach. Thanks On Mon, Aug 20, 2012 at 3:04 AM, Navis류승우 <navis....@nexr.com> wrote: > We've solved similar cases by using UDAF + UDTF, something like.. > > select assign(unique(a, b, c, d)) as (a, b, c, d), e, f from A group by a, > b, c, d; > > unique(UDAF) returns struct and assign(UDTF) reassigns names for the > struct, where you can append a column for marking. > > HIVE-2608 may make this shorter. > > > 2012/8/18 Himanish Kushary <himan...@gmail.com> > >> We want to separate them out into different tables for auditing >> purposes, so i was thinking of marking them and then >> separating them out based on the "duplicate" flag. >> >> Is there a better way of splitting them into 2 tables ? Also how would I >> pick up the first from the group and put into a different table >> once I have run the cross join on fld1 thru fld4 with order by timestamp? >> >> - Himanish >> >> On Fri, Aug 17, 2012 at 11:54 AM, Bertrand Dechoux <decho...@gmail.com>wrote: >> >>> If I understand correctly : >>> A cross join on fld1 thru fld4 with order by timestamp and take the >>> first of each group. >>> would allow you to have all the unique and original copies. >>> >>> But indeed if you don't have something like HBase, you can not mark them >>> really but you can select them and write them elsewhere. >>> Why do you specifically want to mark them? >>> >>> Bertrand >>> >>> >>> On Fri, Aug 17, 2012 at 5:46 PM, Bob Gause <bob.ga...@zyquest.com>wrote: >>> >>>> We use com.facebook.hive.udf.UDFNumberRows to do a ranking by time in >>>> some of our queries. You could do that, and then do another select where >>>> the row number/rank is 1 to get all the "unique" rows. >>>> >>>> There are probably a bunch of other ways to do this, but this is the >>>> one that first came to mind for me…. >>>> >>>> Enjoy! >>>> Bob >>>> >>>> Robert Gause >>>> Senior Systems Engineer >>>> ZyQuest, Inc. >>>> bob.ga...@zyquest.com >>>> >>>> On Aug 17, 2012, at 9:49 AM, Himanish Kushary wrote: >>>> >>>> > Hi, >>>> > >>>> > We have a huge table which may have duplicate records.A record is >>>> considered duplicate based on 4 fields ( fld1 thru fld4) . We need to >>>> identify the duplicate records and possibly mark the duplicates(except the >>>> first record based on created time for a record). >>>> > >>>> > Is this something that could be done by hive or we need to write >>>> custom M/R for this.Could a inner join or a select with group by be used to >>>> find the duplicates ? How do I mark the duplicate records as there is no >>>> update. >>>> > >>>> > Whats the best way to do this using Hive ? Looking forward to hear >>>> the suggestions. >>>> > >>>> > Thanks >>>> >>>> >>> >>> >>> -- >>> Bertrand Dechoux >>> >> >> >> >> -- >> Thanks & Regards >> Himanish >> > > -- Thanks & Regards Himanish