Hi,

We have a huge table which may have duplicate records.A record is
considered duplicate based on 4 fields ( fld1 thru fld4) . We need to
identify the duplicate records and possibly mark the duplicates(except the
first record based on created time for a record).

Is this something that could be done by hive or we need to write custom M/R
for this.Could a inner join or a select with group by be used to find the
duplicates ? How do I mark the duplicate records as there is no update.

Whats the best way to do this using Hive ? Looking forward to hear the
suggestions.

Thanks

Reply via email to