Hi Sean,

Now let's assume we have column C and column D normalized, and the metric
is simplified to
abs( C1 - C2 ) + abs (D1 - D2)

Can we benefit the performance from LSH?

Thank you again!

Best,
Rex


On Tue, Sep 13, 2016 at 12:47 PM, Sean Owen <so...@cloudera.com> wrote:

> Given the nature of your metric, I don't think you can use things like
> LSH which more or less depend on a continuous metric space. This is
> too specific to fit into a general framework usefully I think, but, I
> think you can solve this directly with some code without much trouble.
>
> On Tue, Sep 13, 2016 at 8:45 PM, Mobius ReX <aoi...@gmail.com> wrote:
> > Hi Sean,
> >
> > Great!
> >
> > Is there any sample code implementing Locality Sensitive Hashing with
> Spark,
> > in either scala or python?
> >
> > "However if your rule is really like "must match column A and B and
> > then closest value in column C then just ordering everything by A, B,
> > C lets you pretty much read off the answer from the result set
> > directly. Everything is closest to one of its two neighbors."
> >
> > This is interesting since we can use Lead/Lag Windowing function if we
> have
> > only one continuous column. However,
> > our rule is "must match column A and B and then closest values in column
> C
> > and D - for any ID with column E = 0, and the closest ID with Column E =
> 1".
> > The distance metric between ID1 (with Column E =0) and ID2 (with Column E
> > =1) is defined as
> > abs( C1/C1 - C2/C1 ) + abs (D1/D1 - D2/D1)
> > One cannot do
> > abs( (C1/C1 + D1/D1) - (C2/C1 + D2/ D1) )
> >
> >
> > Any further tips?
> >
> > Best,
> > Rex
> >
> >
> >
> > On Tue, Sep 13, 2016 at 11:09 AM, Sean Owen <so...@cloudera.com> wrote:
> >>
> >> The key is really to specify the distance metric that defines
> >> "closeness" for you. You have features that aren't on the same scale,
> >> and some that aren't continuous. You might look to clustering for
> >> ideas here, though mostly you just want to normalize the scale of
> >> dimensions to make them comparable.
> >>
> >> You can find nearest neighbors by brute force. If speed really matters
> >> you can consider locality sensitive hashing, which isn't that hard to
> >> implement and can give a lot of speed for a small cost in accuracy.
> >>
> >> However if your rule is really like "must match column A and B and
> >> then closest value in column C then just ordering everything by A, B,
> >> C lets you pretty much read off the answer from the result set
> >> directly. Everything is closest to one of its two neighbors.
> >>
> >> On Tue, Sep 13, 2016 at 6:18 PM, Mobius ReX <aoi...@gmail.com> wrote:
> >> > Given a table
> >> >
> >> >>     $cat data.csv
> >> >>
> >> >>     ID,State,City,Price,Number,Flag
> >> >>     1,CA,A,100,1000,0
> >> >>     2,CA,A,96,1010,1
> >> >>     3,CA,A,195,1010,1
> >> >>     4,NY,B,124,2000,0
> >> >>     5,NY,B,128,2001,1
> >> >>     6,NY,C,24,30000,0
> >> >>     7,NY,C,27,30100,1
> >> >>     8,NY,C,29,30200,0
> >> >>     9,NY,C,39,33000,1
> >> >
> >> >
> >> > Expected Result:
> >> >
> >> >     ID0, ID1
> >> >     1,2
> >> >     4,5
> >> >     6,7
> >> >     8,7
> >> >
> >> > for each ID with Flag=0 above, we want to find another ID from Flag=1,
> >> > with
> >> > the same "State" and "City", and the nearest Price and Number
> normalized
> >> > by
> >> > the corresponding values of that ID with Flag=0.
> >> >
> >> > For example, ID = 1 and ID=2, has the same State and City, but
> different
> >> > FLAG.
> >> > After normalized the Price and Number (Price divided by 100, Number
> >> > divided
> >> > by 1000), the distance between ID=1 and ID=2 is defined as :
> >> > abs(100/100 - 96/100) + abs(1000/1000 - 1010/1000) = 0.04 + 0.01 =
> 0.05
> >> >
> >> >
> >> > What's the best way to find such nearest neighbor? Any valuable tips
> >> > will be
> >> > greatly appreciated!
> >> >
> >> >
> >
> >
>

Reply via email to