Hi, We have a huge table which may have duplicate records.A record is considered duplicate based on 4 fields ( fld1 thru fld4) . We need to identify the duplicate records and possibly mark the duplicates(except the first record based on created time for a record).
Is this something that could be done by hive or we need to write custom M/R for this.Could a inner join or a select with group by be used to find the duplicates ? How do I mark the duplicate records as there is no update. Whats the best way to do this using Hive ? Looking forward to hear the suggestions. Thanks