Hi Sean, Now let's assume we have column C and column D normalized, and the metric is simplified to abs( C1 - C2 ) + abs (D1 - D2)
Can we benefit the performance from LSH? Thank you again! Best, Rex On Tue, Sep 13, 2016 at 12:47 PM, Sean Owen <so...@cloudera.com> wrote: > Given the nature of your metric, I don't think you can use things like > LSH which more or less depend on a continuous metric space. This is > too specific to fit into a general framework usefully I think, but, I > think you can solve this directly with some code without much trouble. > > On Tue, Sep 13, 2016 at 8:45 PM, Mobius ReX <aoi...@gmail.com> wrote: > > Hi Sean, > > > > Great! > > > > Is there any sample code implementing Locality Sensitive Hashing with > Spark, > > in either scala or python? > > > > "However if your rule is really like "must match column A and B and > > then closest value in column C then just ordering everything by A, B, > > C lets you pretty much read off the answer from the result set > > directly. Everything is closest to one of its two neighbors." > > > > This is interesting since we can use Lead/Lag Windowing function if we > have > > only one continuous column. However, > > our rule is "must match column A and B and then closest values in column > C > > and D - for any ID with column E = 0, and the closest ID with Column E = > 1". > > The distance metric between ID1 (with Column E =0) and ID2 (with Column E > > =1) is defined as > > abs( C1/C1 - C2/C1 ) + abs (D1/D1 - D2/D1) > > One cannot do > > abs( (C1/C1 + D1/D1) - (C2/C1 + D2/ D1) ) > > > > > > Any further tips? > > > > Best, > > Rex > > > > > > > > On Tue, Sep 13, 2016 at 11:09 AM, Sean Owen <so...@cloudera.com> wrote: > >> > >> The key is really to specify the distance metric that defines > >> "closeness" for you. You have features that aren't on the same scale, > >> and some that aren't continuous. You might look to clustering for > >> ideas here, though mostly you just want to normalize the scale of > >> dimensions to make them comparable. > >> > >> You can find nearest neighbors by brute force. If speed really matters > >> you can consider locality sensitive hashing, which isn't that hard to > >> implement and can give a lot of speed for a small cost in accuracy. > >> > >> However if your rule is really like "must match column A and B and > >> then closest value in column C then just ordering everything by A, B, > >> C lets you pretty much read off the answer from the result set > >> directly. Everything is closest to one of its two neighbors. > >> > >> On Tue, Sep 13, 2016 at 6:18 PM, Mobius ReX <aoi...@gmail.com> wrote: > >> > Given a table > >> > > >> >> $cat data.csv > >> >> > >> >> ID,State,City,Price,Number,Flag > >> >> 1,CA,A,100,1000,0 > >> >> 2,CA,A,96,1010,1 > >> >> 3,CA,A,195,1010,1 > >> >> 4,NY,B,124,2000,0 > >> >> 5,NY,B,128,2001,1 > >> >> 6,NY,C,24,30000,0 > >> >> 7,NY,C,27,30100,1 > >> >> 8,NY,C,29,30200,0 > >> >> 9,NY,C,39,33000,1 > >> > > >> > > >> > Expected Result: > >> > > >> > ID0, ID1 > >> > 1,2 > >> > 4,5 > >> > 6,7 > >> > 8,7 > >> > > >> > for each ID with Flag=0 above, we want to find another ID from Flag=1, > >> > with > >> > the same "State" and "City", and the nearest Price and Number > normalized > >> > by > >> > the corresponding values of that ID with Flag=0. > >> > > >> > For example, ID = 1 and ID=2, has the same State and City, but > different > >> > FLAG. > >> > After normalized the Price and Number (Price divided by 100, Number > >> > divided > >> > by 1000), the distance between ID=1 and ID=2 is defined as : > >> > abs(100/100 - 96/100) + abs(1000/1000 - 1010/1000) = 0.04 + 0.01 = > 0.05 > >> > > >> > > >> > What's the best way to find such nearest neighbor? Any valuable tips > >> > will be > >> > greatly appreciated! > >> > > >> > > > > > >