Re: Identifying and Marking records as duplicates

Himanish Kushary Tue, 21 Aug 2012 10:30:00 -0700

Thanks everyone for the different possible solutions.We resolved this by
using the rank function and following the instructions at
https://github.com/edwardcapriolo/hive-rank


This is similar to Bob's approach.

Thanks

On Mon, Aug 20, 2012 at 3:04 AM, Navis류승우 <navis....@nexr.com> wrote:

> We've solved similar cases by using UDAF + UDTF, something like..
>
> select assign(unique(a, b, c, d)) as (a, b, c, d), e, f from A group by a,
> b, c, d;
>
> unique(UDAF) returns struct and assign(UDTF) reassigns names for the
> struct, where you can append a column for marking.
>
> HIVE-2608 may make this shorter.
>
>
> 2012/8/18 Himanish Kushary <himan...@gmail.com>
>
>> We want to separate them out into different tables for auditing
>> purposes, so i was thinking of marking them and then
>> separating them out based on the "duplicate" flag.
>>
>> Is there a better way of splitting them into 2 tables ? Also how would I
>> pick up the first from the group and put into a different table
>> once I have run the cross join on fld1 thru fld4 with order by timestamp?
>>
>> - Himanish
>>
>> On Fri, Aug 17, 2012 at 11:54 AM, Bertrand Dechoux <decho...@gmail.com>wrote:
>>
>>> If I understand correctly :
>>> A cross join on fld1 thru fld4 with order by timestamp and take the
>>> first of each group.
>>> would allow you to have all the unique and original copies.
>>>
>>> But indeed if you don't have something like HBase, you can not mark them
>>> really but you can select them and write them elsewhere.
>>> Why do you specifically want to mark them?
>>>
>>> Bertrand
>>>
>>>
>>> On Fri, Aug 17, 2012 at 5:46 PM, Bob Gause <bob.ga...@zyquest.com>wrote:
>>>
>>>> We use com.facebook.hive.udf.UDFNumberRows to do a ranking by time in
>>>> some of our queries. You could do that, and then do another select where
>>>> the row number/rank is 1 to get all the "unique" rows.
>>>>
>>>> There are probably a bunch of other ways to do this, but this is the
>>>> one that first came to mind for me….
>>>>
>>>> Enjoy!
>>>> Bob
>>>>
>>>> Robert Gause
>>>> Senior Systems Engineer
>>>> ZyQuest, Inc.
>>>> bob.ga...@zyquest.com
>>>>
>>>> On Aug 17, 2012, at 9:49 AM, Himanish Kushary wrote:
>>>>
>>>> > Hi,
>>>> >
>>>> > We have a huge table which may have duplicate records.A record is
>>>> considered duplicate based on 4 fields ( fld1 thru fld4) . We need to
>>>> identify the duplicate records and possibly mark the duplicates(except the
>>>> first record based on created time for a record).
>>>> >
>>>> > Is this something that could be done by hive or we need to write
>>>> custom M/R for this.Could a inner join or a select with group by be used to
>>>> find the duplicates ? How do I mark the duplicate records as there is no
>>>> update.
>>>> >
>>>> > Whats the best way to do this using Hive ? Looking forward to hear
>>>> the suggestions.
>>>> >
>>>> > Thanks
>>>>
>>>>
>>>
>>>
>>> --
>>> Bertrand Dechoux
>>>
>>
>>
>>
>> --
>> Thanks & Regards
>> Himanish
>>
>
>


-- 
Thanks & Regards
Himanish

Re: Identifying and Marking records as duplicates

Reply via email to