What's the best way to find the nearest neighbor in Hive? Any windowing function?

Mobius ReX Tue, 13 Sep 2016 12:47:50 -0700

Given a table

>     $cat data.csv
>
>     ID,State,City,Price,Number,Flag
>     1,CA,A,100,1000,0
>     2,CA,A,96,1010,1
>     3,CA,A,195,1010,1
>     4,NY,B,124,2000,0
>     5,NY,B,128,2001,1
>     6,NY,C,24,30000,0
>     7,NY,C,27,30100,1
>     8,NY,C,29,30200,0
>     9,NY,C,39,33000,1



Expected Result:

    ID0, ID1
    1,2
    4,5
    6,7
    8,7

for each ID with Flag=0 above, we want to find another ID from Flag=1, with
the same "State" and "City", and the nearest Price and Number normalized by
the corresponding values of that ID with Flag=0.

For example, ID = 1 and ID=2, has the same State and City, but different
FLAG.
After normalized the Price and Number (Price divided by 100, Number divided
by 1000), the distance between ID=1 and ID=2 is:
abs(100/100 - 96/100) + abs(1000/1000 - 1010/1000) = 0.04 + 0.01 = 0.05


What's the best way to find such nearest neighbor in Hive? Can we use
Lead/Lag or Rank for this case? Any valuable tips will be greatly
appreciated!

What's the best way to find the nearest neighbor in Hive? Any windowing function?

Reply via email to