We want to separate them out into different tables for auditing purposes, so i was thinking of marking them and then separating them out based on the "duplicate" flag.
Is there a better way of splitting them into 2 tables ? Also how would I pick up the first from the group and put into a different table once I have run the cross join on fld1 thru fld4 with order by timestamp? - Himanish On Fri, Aug 17, 2012 at 11:54 AM, Bertrand Dechoux <decho...@gmail.com>wrote: > If I understand correctly : > A cross join on fld1 thru fld4 with order by timestamp and take the first > of each group. > would allow you to have all the unique and original copies. > > But indeed if you don't have something like HBase, you can not mark them > really but you can select them and write them elsewhere. > Why do you specifically want to mark them? > > Bertrand > > > On Fri, Aug 17, 2012 at 5:46 PM, Bob Gause <bob.ga...@zyquest.com> wrote: > >> We use com.facebook.hive.udf.UDFNumberRows to do a ranking by time in >> some of our queries. You could do that, and then do another select where >> the row number/rank is 1 to get all the "unique" rows. >> >> There are probably a bunch of other ways to do this, but this is the one >> that first came to mind for me…. >> >> Enjoy! >> Bob >> >> Robert Gause >> Senior Systems Engineer >> ZyQuest, Inc. >> bob.ga...@zyquest.com >> >> On Aug 17, 2012, at 9:49 AM, Himanish Kushary wrote: >> >> > Hi, >> > >> > We have a huge table which may have duplicate records.A record is >> considered duplicate based on 4 fields ( fld1 thru fld4) . We need to >> identify the duplicate records and possibly mark the duplicates(except the >> first record based on created time for a record). >> > >> > Is this something that could be done by hive or we need to write custom >> M/R for this.Could a inner join or a select with group by be used to find >> the duplicates ? How do I mark the duplicate records as there is no update. >> > >> > Whats the best way to do this using Hive ? Looking forward to hear the >> suggestions. >> > >> > Thanks >> >> > > > -- > Bertrand Dechoux > -- Thanks & Regards Himanish