If I understand correctly : A cross join on fld1 thru fld4 with order by timestamp and take the first of each group. would allow you to have all the unique and original copies.
But indeed if you don't have something like HBase, you can not mark them really but you can select them and write them elsewhere. Why do you specifically want to mark them? Bertrand On Fri, Aug 17, 2012 at 5:46 PM, Bob Gause <bob.ga...@zyquest.com> wrote: > We use com.facebook.hive.udf.UDFNumberRows to do a ranking by time in some > of our queries. You could do that, and then do another select where the row > number/rank is 1 to get all the "unique" rows. > > There are probably a bunch of other ways to do this, but this is the one > that first came to mind for me…. > > Enjoy! > Bob > > Robert Gause > Senior Systems Engineer > ZyQuest, Inc. > bob.ga...@zyquest.com > > On Aug 17, 2012, at 9:49 AM, Himanish Kushary wrote: > > > Hi, > > > > We have a huge table which may have duplicate records.A record is > considered duplicate based on 4 fields ( fld1 thru fld4) . We need to > identify the duplicate records and possibly mark the duplicates(except the > first record based on created time for a record). > > > > Is this something that could be done by hive or we need to write custom > M/R for this.Could a inner join or a select with group by be used to find > the duplicates ? How do I mark the duplicate records as there is no update. > > > > Whats the best way to do this using Hive ? Looking forward to hear the > suggestions. > > > > Thanks > > -- Bertrand Dechoux